PCFG models of linguistic tree representations

整理文档很辛苦,赏杯茶钱您下走!

免费阅读已结束,点击下载阅读编辑剩下 ...

阅读已结束,您可以下载文档离线阅读编辑

资源描述

PCFGModelsofLinguisticTreeRepresentationsMarkJohnson*BrownUniversityThekindsoftreerepresentationsusedinatreebankcorpuscanhaveadramaticeffectonperfor-manceofaparserbasedonthePCFGestimatedfromthatcorpus,causingtheestimatedlikelihoodofatreetodiffersubstantiallyfromitsfrequencyinthetrainingcorpus.ThispaperpointsoutthatthePenn1Itreebankrepresentationsareofthekindpredictedtohavesuchaneffect,anddescribesasimplenoderelabelingtransformationthatimprovesatreebankPCFG-basedparser'saverageprecisionandrecallbyaround8%,orapproximatelyhalfoftheperformancedifferencebetweenasimplePCFGmodelandthebestbroad-coverageparsersavailabletoday.ThisperformancevariationcomesaboutbecauseanyPCFG,andhencethecorpusoftreesfromwhichthePCFGisinduced,embodiesindependenceassumptionsaboutthedistributionofwordsandphrases.Theparticularindependenceassumptionsimplicitinatreerepresentationcanbestudiedtheoreticallyandinvestigatedempiricallybymeansofatreetransformation/detransformationprocess.1.IntroductionProbabalisticcontext-freegrammars(PCFGs)providesimplestatisticalmodelsofnat-urallanguages.Therelativefrequencyestimatorprovidesastraightforwardwayofinducingthesegrammarsfromtreebankcorpora,andabroad-coverageparsingsystemcanbeobtainedbyusingaparsertofindamaximum-likelihoodparsetreefortheinputstringwithrespecttosuchatreebankgram_mar.PCFGparsingsystemsoftenperformaswellasothersimplebroad-coverageparsingsystemforpredictingtreestructurefrompart-of-speech(POS)tagsequences(Charniak1996).WhilePCFGmodelsdonotperformaswellasmodelsthataresensitivetoawiderrangeofdependencies(Collins1996),theirsimplicitymakesthemstraightforwardtoanalyzeboththeoreticallyandempirically.Moreover,sincemoresophisticatedsystemscanbeviewedasrefinementsofthebasicPCFGmodel(Charniak1997),itseemsreasonabletofirstattempttobetterunderstandthepropertiesofPCFGmodelsthemselves.Itiswellknownthatnaturallanguageexhibitsdependenciesthatcontext-freegrammars(CFGs)cannotdescribe(Culy1985;Shieber1985).Butthestatisticalin-dependenceassumptionsembodiedinaparticularPCFGdescriptionofaparticularnaturallanguageconstructionareingeneralmuchstrongerthantherequirementthattheconstructionbegeneratedbyaCFG.WeshowbelowthatthePCFGextensionofwhatseemstobeanadequateCFGdescriptionofPPattachmentconstructionsper-formsnobetterthanPCFGmodelsestimatedfromnon-CFGaccountsofthesameconstructions.Morespecifically,thispaperstudiestheeffectofvaryingthetreestructurerepre-sentationofPPmodificationfrombothatheoreticalandanempiricalpointofview.ItcomparesPCFGmodelsinducedfromtreebanksusingseveraldifferenttreerepre-*DepartmentofCognitiveandLinguisticSciences,Box1978,Providence,RI02912(~)1998AssociationforComputationalLinguisticsComputationalLinguisticsVolume24,Number4sentations,includingtherepresentationusedinthePennIItreebankcorpora(Marcus,Santorini,andMarcinkiewicz1993)andtheChomskyadjunctionrepresentationnowstandardlyassumedingenerativelinguistics.OneoftheweaknessesofaPCFGmodelisthatitisinsensitivetononlocalre-lationshipsbetweennodes.IftheserelationshipsaresignificantthenaPCFGwillbeapoorlanguagemodel.Indeed,thesenseinwhichthesetoftreesgeneratedbyaCFGiscontextfreeispreciselythatthelabelonanodecompletelycharacterizestherelationshipsbetweenthesubtreedominatedbythenodeandthenodesthatproperlydominatethissubtree.Roughlyspeaking,themorenodesinthetreesofthetrainingcorpus,thestrongertheindependenceassumptionsinthePCFGlanguagemodelinducedfromthosetrees.Forexample,aPCFGinducedfromacorpusofcompletelyflattrees(i.e.,consistingoftherootnodeimmediatelydominatingastringofterminals)generatespreciselythestringsoftrainingcorpuswithlikelihoodsequaltotheirrelativefrequenciesinthatcorpus.ThusthelocationandlabelingonthenonrootnonterminalnodesdeterminehowaPCFGinducedfromatreebankgeneralizesfromthattrainingdata.Generally,onemightexpectthatthefewerthenodesinthetrainingcorpustrees,theweakertheindependenceassumptionsintheinducedlanguagemodel.Forthisreason,aflattreerepresentationofPPmodificationisinvestigatedhereaswell.AsecondmethodofrelaxingtheindependenceassumptionsimplicitinaPCFGistoencodemoreinformationineachnode'slabel.Heretheintuitionisthatthelabelonanodeisacommunicationchannelthatconveysinformationbetweenthesubtreedominatedbythenodeandthepartofthetreenotdominatedbythisnode,soallotherthingsbeingequal,appendingtothenode'slabeladditionalinformationaboutthecontextinwhichthenodeappearsshouldmaketheindependenceassumptionsimplicitinthePCFGmodelweaker.Theeffectofaddingaparticularlysimplekindofcontextualinformation--thecategoryofthenode'sparent--isalsostudiedinthispaper.WhethereitherofthesetwoPCFGmodelsoutperformsaPCFGinducedfromtheoriginaltreebankisaseparatequestion.Wefaceaclassicalbiasversusvariancedilemmahere(Geman,Bienenstock,andDoursat1992):astheindependenceassump-tionsimplicitinthePCFGmodelareweakened,thenumberofparametersthatmustbeestimated(i.e.,thenumberofproductions)increases.Thuswhilemovingtoaclassofmodelswithweakerindependenceassumptionspermitsustomoreaccuratelyde-scribeawiderclassofdistributions(i.e.,itreducesthebiasimplicitintheestimator),ingeneralouresti

1 / 20
下载文档,编辑使用

©2015-2020 m.777doc.com 三七文档.

备案号:鲁ICP备2024069028号-1 客服联系 QQ:2149211541

×
保存成功