DataDrivenApproachesforLarge-scaleTaxonomyConstructionJiaqingLiang,FudanUniversityYanghuaXiao,FudanUniversityPreliminaries•is-a•is-a,(is-a-subtype-oforis-a-subclass-of).•Thisdefineswhichobjectsareclassifiedbywhichclass.•Forexample,FordExploreris-a-subclass-of4-WheelDriveCar,whichinturnis-a-subclass-ofCar•Hypernym&hyponym,conceptandentity•appleisAfruit,orhyponym(apple,fruit)•fruitisapple’shypernym/concept(superclass)•appleisfruit’shyponym/entity(subclass)•Herethe`entity`maybea`sub-concept`•Taxonomy•Theadditionofisarelationshipscreatesataxonomy:atree-likestructure•Wesimplycallanodeinthetaxonomy(entityorconcept)aterm,itisawordoraphrase.From:•Understandaninstance•iphoneisAsmartphoneenablesmachinetounderstandthesearchintentofiphone(i.e.smartphone).•Entityrecommendation•galaxys4isAsmartphonefurtherallowstorecommendtherelatedkeywordgalaxys4•Manyapplications•machinetranslation•queryexpansion•documentclassification•datacleaning•entityresolution•informationintegrationDataDrivenvsHandCrafted•Manuallyconstructedknowledgegraph•Examples:WordNet,Cyc•Size:Small(Hugehumancost)•Quality:Almostperfect(Eachrelationischeckedbyexpects)•Auto-constructedknowledgegraph•Automaticallyextractedfromhugewebcorpus•Examples:Probase,WikiTaxonomy,etc•Size:Huge(Fromhugecorpus)•Quality:Good(Theaccuracycan’treach100%)•Becauseofthehugesize,therearemanywrongfactsProbase•Aweb-scaletaxonomyderivedfromwebpagesbyHearstlinguisticpatterns•“…famousbasketballplayerssuchasMichaelJordan…”•domesticanimalssuchascatsanddogs...•Chinaisadevelopingcountry.•Lifeisaboxofchocolate.•10Mterms,and16MisArelations•ProbabilisticknowledgebasepoliticianspeoplepresidentsGeorgeW.Bush,0.0117BillClinton,0.0106GeorgeH.W.Bush,0.0063HillaryClinton,0.0054BillClinton,0.057GeorgeH.W.Bush,0.021GeorgeW.Bush,0.019•End-to-end•DomainspecificCompletion•Collaborativefilteringbasedcompletion•TransitivityinferencebasedcompletionCorrection•GraphstructurebasedcorrectionCost:CostlyHumanEffortsQuality:MissingdataQuality:WrongdataMissingisArelationships•ManyvalidisArelationshipsaremissinginthetaxonomy•“bigUKsupermarket”hasnohypernymsinProbase•Datasparsity,therelationshipbetween“bigUKsupermarket”and“publicplace”rarelyappearsexplicitly•“stevejobs”doesnotconnecttotheconcept“billionaire”•Commonsense,itistooobvioustobementionedintexts•MissingisArelationshipsbreaktheinferenceErrorsinautomaticallyconstructedlexicaltaxonomies•WrongisArelationsinProbase:•Errorsincorpus•“…makeParissuchasexcitingcity…”•leadsto'excitingcity’isA‘Paris’•Errorsmadebyinformationextractionalgorithms•Howtodetecterrorsinautomaticallyconstructedlexicaltaxonomies?ChallengesCharacteristicsofdata-driventaxonomiesandchallenges•Web-scale.•TheyusuallycontainmillionsoftermsandtensofmillionsofisArelationships.•Itisagreatchallengeforthescalabilityofsolutions.•Noise.•SomeexistingisArelationshipsarewrong,andmisleading.•InProbase,“germany”isA“latinamericancountry”.Wemightinferthat“france”isA“latinamericancountry”too.•Howtopreventtheinferencefromthenoisyrelationships?•Ambiguity.•AlexicaltaxonomysuchasProbasedoesnotdistinguishthedifferentsensesofaterm.•Forexample,“apple”hasbothhypernymsof“company”and“fruit”inProbase.Wecannotuse“apple”isA“company”toinfer“pear”isA“company.”•Ingeneral,themultiplesensesmaketheinferenceoftrulymissinghypernymsmoredifficult.FindMissingisAviaTransitivityTransitivityintaxonomies•OneofthemostimportantpropertiesoftheisArelationship:transitivity.•Inhuman-craftedtaxonomies,transitivityistakenforgranted•Example1IsEinsteinascientist?•Indata-drivenlexicaltaxonomies,transitivitydoesnotalwayshold•Example2IsEinsteinaprofession?•Example3Isacarseatapieceoffurniture?Inadata-drivenlexicaltaxonomy,whenthetransitivityholds?Ifwecandetermineinwhichcasestransitivityhold,wecangeneratemanymissingisArelations.Challenges•Itisnotatrivialtasktotellwhethertransitivityholdsinadata-drivenlexicaltaxonomy•Naiveapproach:enforcewordsensedisambiguation,justasWordNetdoes•Performingwordsensedisambiguationiscostlyinahugelexicaltaxonomy•Dividingthemeaningofawordintofiniteanddiscretesensesisnotalwayspossible•chairincludesofficechair,bench,stool,carseat,etc.Problemstatementandbasicidea•Problemstatement:•Input:foragiventripleA,B,CinProbasesatisfyingthathyponym(A,B)andhyponym(B,C)•Output:judgewhetherhyponym(A,C)iscorrectornot•Ouridea•Asupervisedbinaryclassificationproblem•Ourworks:•HowtobuildtheLabeleddataset?•HowtodesigneffectiveFeatures?Constructionofthelabeleddataset•Objective•Collectgroundtruthsabouttransitivity•Source•WordNetcontainshypernym-hyponymrelationsamongsynsets.•Example•ThewordtankhastwosynsetsinWordNet.•tank1=storagetank,tank2=armytank.•InWordNet,•hyponym(watertank,tank1),hyponym(tank1,vessel),hyponym(tank2,militaryvehicle)•Thenwatertank,tank,vesselispositive,watertank,tank,militaryvehicleisnegative.•hyponym(watertank,vessel)holdsbecausethetworelationsusethesames