浙江大学肖忠华语料库session 3(外语学习)

整理文档很辛苦,赏杯茶钱您下走!

免费阅读已结束,点击下载阅读编辑剩下 ...

阅读已结束,您可以下载文档离线阅读编辑

资源描述

DatacaptureandcorpusmarkupCorpusLinguisticsRichardXiaolancsxiaoz@googlemail.comOutlineofthesession•Lecture–Datacapture–Somee-textarchives–Copyrightincorpuscreation–Corpusmarkup•Lab–MLCT–WSTWebGetter–SometranscribingtoolsDatatobecollected•Likeotherdecisionsincorpuscreation(e.g.balance,representativeness,size),thekindofdatatobecollectedalsodependsonyourresearchquestions–IfyouwishtocompareBritishEnglishandAmericanEnglish,youwillneedtocollectspokenand/orwrittendataproducedbynativespeakersofthetworegionalvarietiesofEnglish–IfyouareinterestedinhowChinesespeakersacquireEnglishasasecondlanguage,youwillthenneedtocollecttheEnglishdataproducedbyChineselearnerstocreatealearnercorpus–IfyouareinterestedinhowtheEnglishlanguagehasevolvedovercenturies,youwillneedtocollectsamplesofEnglishproducedindifferenthistoricalperiodstobuildahistoricalordiachroniccorpusDatacapture•Havingdevelopedanunderstandingofthetypeofdatayouneedtocollect,andhavingmadesurethatnoready-madecorpusofsuchmaterialexists,you’llneedtocapturethedata•Datadigitalisation–Machine-readabilityisadefactofeatureofamoderncorpusDatacapture•Textmustberenderedmachine-readable–Keyboarding–OCR(OpticalCharacterRecognition)scanning–Transcribingaudio/videorecording•Existingelectronicdataispreferredoverpaper-basedmaterials–TheWebasanimportantsourceofmachine-readabledataformanylanguages–ConvertingotherfileformatsuchasHTML,Word,PDFintoplaintextformat•TheWorld-Wide-Web()isanimportantsourceofelectronictextarchivesSomeusefuldatasource•OxfordTextArchive––Oldesttextarchive-thousandsoftexts(andmanywell-knowncorpora)inmorethan25differentlanguages•ProjectGutenberg––Firstproduceroffreeelectronicbooks–2,8000e-books!•Digitalcollectionsofuniversitylibrariese.g.––•Corpus4uelectronictextarchives–=21Copyrightincorpuscreation•Acorpusconsistingentirelyofcopyright-freeoldtextsisnotusefulinstudyofcontemporarylanguage•Copyrightisamajorissueindatacollectionifyouaretopublishormakeyourcorpuspubliclyavailable•Thesamplestakenundertheconventionof‘fairdealing’incopyrightlawaresosmallastojeopardizeanyclaimofbalanceorrepresentativeness•ThereisasyetnosatisfactorysolutiontotheissueofcopyrightincorpusCopyrightincorpuscreation•Tipsforcopyrightissues–Usuallyeasiertoobtainpermissionforsamplesthanforfulltexts–Easierforsmallersamplesthanforlargerones–Ifyoushowthatyouareactingingoodfaith,andonlysmallsampleswillbeusedinnon-profit-makingresearch,copyrightholdersaretypicallypleasedtograntyoupermission–Youdon’tneedtoworryaboutcopyrightifyoubuildacorpusforyourprivateuse!Corpusmarkup•Systemofstandardcodesinsertedintoadocumentstoredinelectronicformtoprovideinformationaboutthetextitselfandgovernformatting,printingandotherprocesses–Describingthedocument(“metadata”likesource,name,author,date,etc)–Markingboundariesforparagraphs,sentences,andwords,omissionsetc–Displayingmarkup(font,fontsize,positioning)ExampleofmarkupstarttagendtagWhymarkup?•Markuprecoverscontextualinformationofsampledtextswhicharetakenoutofcontext•Markupallowsforabroaderrangeofresearchquestionstobeaddressedbyprovidingextrainformationsuchastexttypes,sociolinguisticvariables,structuralorganization•Markupallowscorpusbuilderstoinserteditorialcommentsduringthecorpusbuildingprocess•Pre-processingwrittentexts,andparticularlytranscribingspokendata,alsoinvolvesmarkup(e.g.pause,paralinguisticfeaturesetc)Markupschemes•Theextramarkupinformationmustbekeptseparatefromthetextualdatainacorpus•Markupschemes–COCOA–TEI(TextEncodingInitiative)–CES(CorpusEncodingStandard)COCOAreference•Oneoftheearliestmarkupschemes•Consistingofasetofattributenamesandvaluesenclosedinangledbrackets–e.g.AWILLIAMSHAKESPEARE•attributename=A(author)•attributevalue=WILLIAMSHAKESPEARE•Onlyencodingalimitedsetoffeaturessuchasauthors,titlesanddates•GivingwaytomoremodernschemesTEIguidelines•Sponsoredbythreemajoracademicassociationsconcernedwithhumanitiescomputing–TheAssociationforComputationalLinguistics(ACL)–TheAssociationforLiteraryandLinguisticComputing(ALLC)–TheAssociationforComputersandtheHumanities(ACH)•AimingtofacilitatedataexchangebystandardizingthemarkuporencodingofinformationstoredinelectronicformTEIguidelines•Eachindividualtextisadocumentconsistinginaheaderandabody,whichareinturncomposedofdifferentelements•TEIcorpusheaderhas4principalelements–Afiledescription(fileDesc):afullbibliographicdescription–Anencodingdescription(encodingDesc):relationshipbetweenanelectronictextanditssourceorsources–Atextprofile(profileDesc):adetaileddescriptionofnon-bibliographicaspectsofatext–Arevisionhistory(revisionDesc):arecordofchangestoafile•OnlyfileDescisrequiredtobeTEI-compliant–Theotherthreeelementsareoptional•Tagscanbenested,i.e.anelementcanappearinsideanotherelementTheBNCheaderTEIguidelines•MarkuplanguagesadoptedbytheTEI–SGML(StandardGeneralizedMarkupLanguage)–XML(eXtensibleMarkupLanguage)•CurrentversionofTEIP5guidelines•SeetheTEIof

1 / 35
下载文档,编辑使用

©2015-2020 m.777doc.com 三七文档.

备案号:鲁ICP备2024069028号-1 客服联系 QQ:2149211541

×
保存成功