1半结构化文本挖掘杨建武Email:yangjianwu@icst.pku.edu.cn第十四章:北京大学计算机科学技术研究所文本挖掘技术(2009)2Text-centricXMLretrieval¾DocumentsmarkedupasXMLE.g.,assemblymanuals,journalissues…¾QueriesareuserinformationneedsE.g.,givemetheSection(element)ofthedocumentthattellsmehowtochangeabrakelightBookChaptersSectionsSubsectionsWorldWideWebThisisonlyonlyanothertolookoneletoshowtheneedanlaaoutstructureofandmoreadocumentandsoasstoitdoenotnecessarytextastructureddocumenthaveretrievalonthewebisanitimportanttopicoftoday’sresearchitissuestomakeselastsentence..3ConceptualmodelStructureddocumentsContent+structureInvertedfile+structureindextf,idf,…Matchingcontent+structurePresentationofrelatedcomponentsDocumentsQueryDocumentrepresentationRetrievalresultsQueryrepresentationIndexingFormulationRetrievalfunctionRelevancefeedback4Approaches…vectorspacemodelextendingDBmodelBooleanmodelnaturallanguageprocessingcognitivemodelontologyparameterestimationtuningsmoothingfusionphrasetermstatisticscollectionstatisticscomponentstatisticsproximitysearchlogisticregressionbeliefmodelrelevancefeedbackdivergencefromrandomnessmachinelearningprobabilisticmodelBayesiannetworklanguagemodel5elementlanguagemodelcollectionlanguagemodelsmoothingparameterλelementscoreelementsizeelementscorearticlescorequeryexpansionwithblindfeedbackignoreelementswith≤20termshighvalueofλleadstoincreaseinsizeofretrievedelementsresultswithλ=0.9,0.5and0.2similarrankelement(UniversityofAmsterdam,INEX2003)Languagemodel6Vectorspacemodelarticleindexabstractindexsectionindexsub-sectionindexparagraphindexRSVnormalisedRSVRSVnormalisedRSVRSVnormalisedRSVRSVnormalisedRSVRSVnormalisedRSVmergetfandidfasforfixedandnon-nestedretrievalunits(IBMHaifa,INEX2003)7VectorspacesandXML¾Vectorspaces–tried+testedframeworkforkeywordretrievalOther“bagofwords”applicationsintext:classification,clustering…¾Fortext-centricXMLretrieval,canwemakeuseofvectorspaceideas?¾Challenge:capturethestructureofanXMLdocumentinthevectorspace.8VectorspacesandXML¾Forinstance,distinguishbetweenthefollowingtwocasesBookTitleAuthorBillGatesMicrosoftBookTitleAuthorBillWulfThePearlyGates9Content-richXML:representationBookTitleAuthorBillMicrosoftBookTitleAuthorWulfPearlyGatesGatesTheBillLexiconterms.10EncodingtheGatesdifferently¾Whataretheaxesofthevectorspace?¾Intextretrieval,therewouldbeasingleaxisforGates¾Herewemustseparateoutthetwooccurrences,underAuthorandTitle¾Thus,axesmustrepresentnotonlyterms,butsomethingabouttheirpositioninanXMLtree11Queries¾Beforeaddressingthis,letusconsiderthekindsofquerieswewanttohandleBookTitleMicrosoftBookTitleAuthorGatesBill12Subtreesandstructure¾Considerallsubtreesofthedocumentthatincludeatleastonelexiconterm:BookTitleAuthorBillMicrosoftGatesBillMicrosoftGatesTitleMicrosoftAuthorBillAuthorGatesBookTitleMicrosoftBillBookAuthorGatese.g.…13Structuralterms:docs+queries¾Calleachoftheresulting(8+,inthepreviousslide)subtreesastructuralterm¾Createoneaxisinthevectorspaceforeachdistinctstructuralterm¾Eachdocumentbecomesavectorinthespaceofstructuralterms¾AquerytreecanlikewisebefactoredintostructuraltermsAndrepresentedasavectorAllowsweightingportionsofthequery14StructuraltermsWeight¾Weightsbasedonfrequenciesfornumberofoccurrences(justaswehadtf)¾Alltheusualissueswithterms(stemming?Casefolding?)remain15Exampleoftfweighting¾Herethestructuraltermscontainingtoorbewouldhavemoreweightthanthosethatdon’tPlayActTobeornottobePlayActbePlayActorPlayActnotPlayActtoHowmanyaxesarethereinthisexample?16Down-weighting¾Forthedocontheleft:inastructuraltermrootedatthenodePlay,shouldn’tHamlethaveahighertfweightthanYorick?¾Idea:multiplytfcontributionofatermtoanodeklevelsupbyγk,forsomeγ1.PlayActAlaspoorYorickSceneTitleHamlet17Down-weightingexample,γ=0.8¾Forthedoconthepreviousslide,thetfofHamletismultipliedby0.8Yorickismultipliedby0.64inanystructuraltermrootedatPlay.18Thenumberofstructuralterms¾Canbehuge!¾Impractical(不切实际的)tobuildavectorspaceindexwithsomanydimensions¾Willexaminepragmatic(注重实效的)solutionstothisshortly;fornow,continuetobelieve…Alright,howhuge,really?19Restrictstructuralterms?¾Dependingontheapplication,wemayrestrictthestructuraltermsE.g.,mayneverwanttoreturnaTitlenode,onlyBookorPlaynodesSodon’tenumerate/index/retrieve/scorestructuraltermsrootedatsomenodes¾TwosolutionsQuery-timematerializationofaxesRestrictthekindsofsubtreestoamanageableset20Query-timematerialization¾Insteadofenumeratingallstructuraltermsofalldocs(andthequery),enumerateonlyforthequeryThelatterishopefullyasmallset¾Now,we’rereducedtocheckingwhichstructuralterm(s)fromthequerymatchasubtreeofanydocument¾Thisistreepatternmatching:givenatexttreeandapatterntree,findmatchesExceptwehavemanytexttreesOurtreesarelabeledandweighted21Example¾HereweseekadocwithHamletinthetitle¾Onfindingthematchwecomputethecosinesimilarityscore¾Afterallmatchesarefound,rankbysortingQuery=HamletTitlePlayActAlaspoorYorickSceneText=HamletTitle22(Stillinfeasible(不可实行的))¾AdocwithYoricksomewhereinit:¾Query=¾Willgettoit…YorickTitle23Restrictingthesubtrees¾Enumeratingallstructuralterms(subtrees)isprohibitiv