Bilingual_Parallel_and_Language_Engineering_by_Som

整理文档很辛苦,赏杯茶钱您下走!

免费阅读已结束,点击下载阅读编辑剩下 ...

阅读已结束,您可以下载文档离线阅读编辑

资源描述

BilingualParallelCorporaandLanguageEngineeringHaroldSomersDepartmentofLanguageEngineering,UMIST,Manchester,Englandharold@ccl.umist.ac.uk1.IntroductionTheuseofcorporahasbecomeanimportantissueinLanguageEngineering(LE).Inthispaperwewillbeconsideringaspecifictypeofcorpus,thebilingualparallelcorpus.By“parallelcorpus”,wemeanatextwhichisavailableintwo(ormore)languages:itmaybeanoriginaltextanditstranslation,oritmaybeatextwhichhasbeenwrittenbyaconsortiumofauthorsinavarietyoflanguages,andthenpublishedinvariouslanguageversions.Acorpusofthistypeoftextissometimescalleda“comparablecorpus”,thoughthistermisalsoused(confusingly)foracorpusofsimilarbutnotnecessarilyequivalenttexts.Anothertermsometimesfoundis“bitext”,duetoBrianHarris(1988).Parallelcorporaareavaluablesourceofakindoflinguisticmetaknowledge,whichformsthebasisoftechniquessuchastokenization,POS-tagging,morphologicalandsyntacticanalysis,whichinturncanbeusedtodevelopLEapplications.Thispaperfocusesonproblems(andsolutions)relatedtotheextractionoflinguisticmeta-knowledgefromparallelcorpora.2.“First,catchyourcorpus”Thefirstrequirementforknowledgeextractionfrombilingualcorporais,ratherobviously,aparallelcorpus.Fullyannotatedalignedmultilingualparallelcorporainanumberoflanguagesarebecomingincreasinglywidelyavailablethroughvariouscoordinatedinternationalefforts.Avisittoanyofanumberofwebsitesdevotedtocorporaingeneralandbilingualcorporainparticularrevealsalonglistofsuchcollections.TheW3CwebsiteatEssexUniversity(cl)isagoodstartingpoint.Nevertheless,eventhoughthenumberofcollectionsiseverincreasing,thenumberofdifferentlanguagesfeaturedisstillrathersmall.Also,someofthecollectionsarerelativelyunfocusedintermsofsubjectmatter.Ineithercasetheremaybeaproblemofcoverageforaparticularneed.Inthiscase,youmightneedtoattempttolocateandanalyseyourcorpusfromscratch.Sowebeginbyconsideringsomewaysofautomaticallylocatingparalleltexts,andsomeissuesinvolvedinretrievingandstoringsuchdata.2.1.LocatingparallelcorporaautomaticallyAlthoughEnglishisoverwhelminglythelinguafrancaoftheWorldWideWeb,agreatnumberofwebsiteshaveparallelmaterialinseverallanguages.Theseevidentlyprovideaninstantsourceofparalleltexts,iftheycanbelocatedandsuccessfullyaligned.BilingualParallelCorporaandLanguageEngineeringInterestingworkonautomaticallyidentifyingandlocatingparallelcorporahasbeeninitiatedbyResnik(1998,1999).Theideaisfirstofalltofindlikelycandidatepairsoftextsusingsuch“tricks”assearchingforsiteswhichseemtohaveparallel“anchors”(seebelow),oftenaccompaniedbyimagesofflags,orpairsoffilenameswhichdifferonlyintheidentificationofalanguage,e.g.withalternativedirectoriesinthepaths,orsuffixessuchas.enand.fr.Thesecandidatesarethenevaluatedbycomparing,inaverysimplisticmanner,theircontent:sincetheyareusuallyHTMLdocuments,itisusuallyquiteeasytoaligntheHTMLmark-up(headingandsubheadingidentifiers,forexample),andtocomparetheamountoftextbetweeneachanchor.Inthisway,wegetaroughmapofthestructuresofthetwodocuments.Thesecanthenbecomparedusingavarietyofmoreorlesssophisticatedtechniqueswhichmayormaynotincludethekindsoflinguisticmethodsusedinthealignmentofknownparalleltexts–seenextsection.Flexibilityinmark-upconventionscanunderminethistechnique,however.Forexample,Figure1showsparallelEnglishandFrenchpages(writtenbythecurrentauthor)withminordifferencesinmark-upandcontent.Figure1.HTMLversionsofparallelwebpages.Noticedifferencesincapitalizationinthetags,orderofelementsintheBODYtag,andtextualdifferences,e.g.anadditionalLIitemintheFrenchversion.HTMLHEADTITLEATLASSymposium/TITLE/HEADBODYbgcolor=fffffftext=115511LINK=004080vLINK=0040800centerimgsrc=”...”alt=logoheight=145width=184h1ArabicTranslationandLocalisationSymposiumpSymposiumsurlaTraductionetlaLocalisationenArabebrimgsrc=arabatlas.gifalt=arabic/h1.../centerpItisoneofthefiveofficiallanguagesoftheUnitedNations,ithas260millionnativespeakers,andisusedasasecondlanguagebyafurther1.3billionpeople....centerliArabiccorpusprocessingliDevelopmentofArabicresourcesliWebtoolsforArabicHTMLHEADTITLESymposiumATLAS/TITLE/HEADBODYTEXT=#115511BGCOLOR=#FFFFFFLINK=#004080VLINK=#048000CENTERIMGSRC=”...ALT=logoHEIGHT=145WIDTH=184H1SymposiumsurlaTraductionetlaLocalisationenArabePArabicTranslationandLocalisationSymposiumBRIMGSRC=arabatlas.gif/h1.../CENTERpL'unedescinqlanguesofficiellesdel'ONUestl'Barabe/B,lalanguematernellede260millionsdelocuteurs,qu'utilisentenviron1.3milliardsdemusulmanscommedeuxièmelangue....CENTERLIlesstandardsdecodagedescaractèresarabes/LILIletraitementdescorpusenarabe/LILIledéveloppementdesressourcespourl'arabe/LILIlesoutilsInternetpourl'arabe/LIHaroldSomers2.2.StorageandencodingHavinglocatedasuitableparallelcorpus,thereremainanumberofaspectstoconsiderbeforetheprocessoflinguisticknowledgeextractioncanbegin.One,whichshouldnotbeignoredistheissueofdeterminingthelegalpositionwithrespecttothetext:eventhoughthe

1 / 16
下载文档,编辑使用

©2015-2020 m.777doc.com 三七文档.

备案号:鲁ICP备2024069028号-1 客服联系 QQ:2149211541

×
保存成功