SIAMJ.COMPUT.Vol.18,No.6,pp.1245-1262,December1989(C)1989SocietyforIndustrialandAppliedMathematics011SIMPLEFASTALGORITHMSFORTHEEDITINGDISTANCEBETWEENTREESANDRELATEDPROBLEMS*KAIZHONGZHANGANDDENNISSHASHA$Abstract.Orderedlabeledtreesaretreesinwhichtheleft-to-rightorderamongsiblingsis.significant.Thedistancebetweentwoorderedtreesisconsideredtobetheweightednumberofeditoperations(insert,delete,andmodify)totransformonetreetoanother.Theproblemofapproximatetreematchingisalsoconsidered.Specifically,algorithmsaredesignedtoanswerthefollowingkindsofquestions:1.Whatisthedistancebetweentwotrees?2.WhatistheminimumdistancebetweenTandTwhenzeroormoresubtreescanberemovedfromT23.Letthepruningofatreeatnodenmeanremovingallthedescendantsofnoden.Theanalogousquestionforpruningsasforsubtreesisanswered.AdynamicprogrammingalgorithmispresentedtosolvethethreequestionsinsequentialtimeO(ITllxIT2lxmin(depth(Tt),leaves(T))xmin(depth(T2),leaves(T2)))andspaceO(Ir,xlT21)comparedwitho(IT,IIT=Ix(depth(T)):x(depth(T2)))forthebestpreviouspublishedalgorithmduetoTai[J.Assoc.Comput.Mach.,26(1979),pp.422-433].Further,thealgorithmpresentedherecanbeparallelizedtogivetimeO(1T[/1T=I).Keywords,trees,editingdistance,parallelalgorithm,dynamicprogramming,patternrecognitionAMS(MOS)subjectclassifications.68P05,68Q25,68Q20,68R101.Motivation.1.1.Applications.Orderedlabeledtreesaretreeswhosenodesarelabeledandinwhichtheleft-to-rightorderamongsiblingsissignificant.Assuchtheycanrepresentgrammarparses,imagedescriptions,andmanyotherphenomena.Comparingsuchtreesisawaytocomparescenes,parses,andsoon.Asanexample,considerthesecondarystructurecomparisonproblemforRNA.BecauseRNAisasinglestrandofnucleotides,itfoldsbackontoitselfintoashapethatistopologicallyatree(calleditssecondarystructure).Eachnodeofthistreecontainsseveralnucleotides.Nodeshavecolorfullabelssuchasbulgeandhairpin.Variousresearchers[ALKBO],[BSSBWD],[DD]haveobservedthatthesecondarystructureinfluencestranslationrates(fromRNAtoproteins).Becausedifferentsequen-cescanproducesimilarsecondarystructuresIDA],[SKI,comparisonsamongsecon-darystructuresarenecessarytounderstandingthecomparativefunctionalityofdifferentRNAs.PreviousmethodsforcomparingmultiplesecondarystructuresofRNAmoleculesrepresentthetreestructuresasparenthesizedstrings[$88].Thesehavebeenrecentlyconvertedtousingourtreedistancealgorithms.Currentlyweareimplementingapackagecontainingalgorithmsdescribedinthispaperandsomeotherrelatedalgorithms.ApreliminaryversionofthepackageisbeingusedattheNationalCancerInstitutefortheRNAcomparisonproblem.1.2.Algorithmicapproach.Thetreedistanceproblemisharderthanthestringdistanceproblem.Intuitively,hereiswhy.Inthestringcase,ifSl[i]S2[j],thenthe*ReceivedbytheeditorsAugust5,1987;acceptedforpublication(inrevisedform)February12,1989.ThisworkwaspartiallysupportedbytheNationalScienceFoundationundergrantnumberDCR8501611andbytheOfficeofNavalResearchundergrantnumberN00014-85-K-0046.CourantInstituteofMathematicalSciences,NewYorkUniversity,251MercerStreet,NewYork,NewYork10012(zhang@csd2.nyu.edu).Presentaddress,DepartmentofComputerScience,MiddlesexCollege,TheUniversityofWesternOntario,London,Ontario,CanadaN6A5B7.tCourantInstituteofMathematicalSciences,NewYorkUniversity,251MercerStreet,NewYork,NewYork,10012(shasha@nyu.edu).12451246K.Z.ZHANGANDD.SHASHAdistancebetweenSl[1..i-1]andSz[1..j--1]isthesameasbetweenSl[1..i]and$2[1..j].Themaindifficultyinthetreecaseisthatpreservingancestorrelationshipsinthemappingbetweentreespreventstheanalogousimplicationfromholding.Byintroducingthedistancebetweenorderedforestsandcarefuleliminationofcertainsubtree-to-subtreedistancecalculationsweareabletoimprovethetimeandspaceofbestpreviouspublishedalgorithm[T].Notethattheimprovementofspaceforthisproblemisextremelyimportantinpracticalapplications.Besidesimprovingonthetimeandspaceofthebestpreviousalgorithm[T],ouralgorithmisfarsimplertounderstandandtoimplement.Instyle,itresemblesalgorithmsforcomputingthedistancebetweenstrings.Infact,thestringdistancealgorithmisaspecialcaseofouralgorithmwhentheinputisastring.2.Definitions.2.1.Editoperationsandeditingdistancebetweentrees.Letusconsiderthreekindsofoperations.Changingnodenmeanschangingthelabelonn.Deletinganodenmeansmakingthechildrenofnbecomethechildrenoftheparentofnandthenremovingn.Insertingisthecomplementofdelete.Thismeansthatinsertingnasthechildofn’willmakentheparentofaconsecutivesubsequenceofthecurrentchildrenofn’.Figs.1-3illustratetheseeditingoperations.T1T2(a--b)FG.(bA)FG.2T1(Ab)FG.3TREEEDITINGDISTANCE1247(1)Change.Tochangeonenodelabeltoanother.(2)Delete.Todeleteanode.(Allchildrenofthedeletednodebbecomechildrenoftheparenta.)(3)Insert.Toinsertanode.(Aconsecutivesequenceofsiblingsamongthechildrenofabecomethechildrenofb.)Following[WF]and[T],werepresentaneditoperationasapair(a,b)(A,A),sometimeswrittenasab,whereaiseitherAoralabelofanodeintreeT1andbiseitherAoralabelofanodeintreeT2.WecallabachangeoperationifaAandbA;adeleteoperationifbA;andaninsertoperationifaA.Sincemanynodesmayhavethesamelabel,thisnotationispotentiallyambiguous.It