OntheStatisticalConsistencyofDOPEstimatorsDetlefPrescher,RemkoScha,KhalilSima’anandAndreasZollmannInstituteforLogic,LanguageandComputation(ILLC)UniversiteitvanAmsterdamAbstractAstatisticalestimatorattemptstoguessanunknownprobabilitydistributionbyanalyzingasamplefromthisdistribution.Onedesirablepropertyofanestimatoristhatitsguessisincreasinglylikelytogetarbitrarilyclosetotheactualdistributionasthesamplesizeincreases.Thispropertyiscalledconsistency.DataOrientedParsing(DOP)employsallfragmentsofthetreesinatrainingtree-bank,includingthefullparse-treesthemselves,astherewriterulesofaprobabilistictree-substitutiongrammar.SincethemostpopularDOP-estimator(DOP1)wasshowntobeinconsistent,thereisanoutstandingtheoreticalquestionconcerningthepossibilityofDOP-estimatorswithreasonablestatisticalproperties.Thisquestionconstitutesthetopicofthecurrentpaper.First,weshowthat,contrarytocommonwisdom,anyunbiasedestimatorforDOPisfutilebecauseitwillnotgeneralizeoverthetrainingtreebank.Subsequently,weshowthataconsistentestimatorthatgeneralizesoverthetreebankshouldinvolvealocalsmoothingtechnique.ThisexposestherelationbetweenDOPandexistingmemory-basedmodelsthatworkwithfullmemoryandananalogicalfunctionsuchask-nearestneighbor,whichisknowntoimplementbackoffsmoothing.Finally,wepresentanewconsistentbackoff-basedestimatorforDOPanddiscusshowitcombinesthememory-basedpreferenceforthelongestmatchwiththeprobabilisticpref-erenceforthemostfrequentmatch.1IntroductionProbabilisticsystemsforsyntacticdisambiguationareusuallybasedonnon-redundant,linguisticallyinspiredgrammars.Thesimplestofthese(suchasProb-abilisticContext-FreeGrammars)treatthedifferentgrammarrulesasstatisticallyindependent.Sincethisresultsinsub-optimalpredictions,morerecentmodelsallowtheprobabilitiesofrulestobeconditionedontheirlocalcontext,e.g.onthelabelsofparent-andsister-nodes,head-POS-tagsorhead-words(Black,Je-linek,Lafferty,Magerman,MercerandRoukos1993,Collins1997,Johnson1998,Charniak2000,CollinsandDuffy2002).Data-OrientedParsing(DOP)(Scha1990,Bod1995)constitutesanalternativeapproachtothisproblem.DOPsystemsemployunconditionalprobabilities,buttheyachieveafairlystrongexpressivepowerbecausetheydroptheassumptionofnon-redundancy.DOPtreatsall(arbitrarilylarge,andpossiblyoverlapping)subtreesofanarbitrarilylargetreebankasaccessibleelementsofaperson’spastlinguisticexperience,andassumesthatallofthemarepotentiallyrelevantforsub-sequentdisambiguationdecisions.DOPisthusreminiscentofnon-probabilisticclassificationtechniqueslikek-nearest-neighborandotherformsofMemory-BasedLearning(Stanfilland64DetlefPrescher,RemkoScha,KhalilSima’anandAndreasZollmannWaltz1986,Daelemans1995),whichanalyzenewinputbymatchingitwithastoreofearlierinstances.Asemphasizedin(Scha1990),DOPshouldbuildasyn-thesisbetweenthememory-basedapproachandthestatisticalapproach:“(1)Theanalogybetweentheinputsentenceandthecorpusshouldbeconstructedinthesimplestpossibleway,i.e.:thenumberofconstructionsthatisusedtore-constructthesentenceshouldbeassmallaspossible.(2)Morefrequentconstructionsshouldbepreferredabovelessfrequentones.”Non-probabilisticDOPsystemswhichexplicitlyimplementthefirstofthesedesiderataweredevelopedonlyrecently(DePauw2000a,DePauw2000b,Bod2000b).MostworkonDOPhasemployedStochasticTreeSubstitutionGram-mars,andhasassumedthatthepreferenceforthesimplestanalysiswillbeanatu-ralside-effectofchoosingtherightprobabilityassignments.VariousmethodsforestimatingthesubtreeprobabilitieshavebeenproposedfortheDOPmodel.Themotivationfortheseproposals,however,hasbeenlargelyheuristicorpractical.Afundamentalquestionhasthusremainedunanswered:Isitpossibletousesta-tisticallywell-motivatedestimatorsfortreebank-grammarswithsuchalargeandredundantparameterspace?Background.NegativeresultsaboutsomeDOPestimatorshavebeenestab-lishedalready.Johnson(2002)investigatedtheDOP1estimator(Bod1995)whichestimatesthesubstitutionprobabilityofasubtreeforanon-terminalnodedi-rectlyastherelativefrequencyofthissubtreeamongallcorpussubtreeswiththesameroot-label.Johnsonshowedthatthisestimatorisinconsistentandbi-ased.Bonnemaetal.(1999)discussMaximumLikelihoodEstimation(MLE)andconcludethatitcannotbeusedforDOP.TheyprovideanexamplewheretheMaximum-LikelihoodEstimatorcompletelyoverfitstheprobabilitymodeltothetrainingtreebank,i.e.itassignszeroprobabilitytoallparseswhichdonotoccurinthetreebank.ItshouldbenotedthatimplementedsystemsrarelydeploythefullDOPmodel.Limitationsofspaceandtimeusuallyrequirethatthesubtree-setisrestrictedtosubtreeswhichsatisfycertaincriteria(maximumdepth,minimal/maximalnumberofterminal/non-terminalleaves,etc.).Thus,animplementedsystemmayweakenorloosesomepropertiesofthefullDOPmodel.Inordertounderstandthees-sentialfeaturesofDOP,however,itisinterestingtoinvestigatetheunrestrictedmodel.Preview.Thispaperinvestigatesthepossibilityofconsistent,non-overfittinges-timatorsfortheunrestrictedDOPmodel.(Wethusexcludesubtreeselectiononthebasisofaprioricriteria.)ThepaperstrengthenstheexistingnegativeresultsaboutDOPestimators.ItprovesthatanyunbiasedestimatorfortheDOPmodelwillyieldaprobabi