文本挖掘技术14-XML

整理文档很辛苦,赏杯茶钱您下走!

免费阅读已结束,点击下载阅读编辑剩下 ...

阅读已结束,您可以下载文档离线阅读编辑

资源描述

1半结构化文本挖掘杨建武Email:yangjianwu@icst.pku.edu.cn第十四章:北京大学计算机科学技术研究所文本挖掘技术(2009)2Text-centricXMLretrieval¾DocumentsmarkedupasXML™E.g.,assemblymanuals,journalissues…¾Queriesareuserinformationneeds™E.g.,givemetheSection(element)ofthedocumentthattellsmehowtochangeabrakelightBookChaptersSectionsSubsectionsWorldWideWebThisisonlyonlyanothertolookoneletoshowtheneedanlaaoutstructureofandmoreadocumentandsoasstoitdoenotnecessarytextastructureddocumenthaveretrievalonthewebisanitimportanttopicoftoday’sresearchitissuestomakeselastsentence..3ConceptualmodelStructureddocumentsContent+structureInvertedfile+structureindextf,idf,…Matchingcontent+structurePresentationofrelatedcomponentsDocumentsQueryDocumentrepresentationRetrievalresultsQueryrepresentationIndexingFormulationRetrievalfunctionRelevancefeedback4Approaches…vectorspacemodelextendingDBmodelBooleanmodelnaturallanguageprocessingcognitivemodelontologyparameterestimationtuningsmoothingfusionphrasetermstatisticscollectionstatisticscomponentstatisticsproximitysearchlogisticregressionbeliefmodelrelevancefeedbackdivergencefromrandomnessmachinelearningprobabilisticmodelBayesiannetworklanguagemodel5elementlanguagemodelcollectionlanguagemodelsmoothingparameterλelementscoreelementsizeelementscorearticlescorequeryexpansionwithblindfeedbackignoreelementswith≤20termshighvalueofλleadstoincreaseinsizeofretrievedelementsresultswithλ=0.9,0.5and0.2similarrankelement(UniversityofAmsterdam,INEX2003)Languagemodel6Vectorspacemodelarticleindexabstractindexsectionindexsub-sectionindexparagraphindexRSVnormalisedRSVRSVnormalisedRSVRSVnormalisedRSVRSVnormalisedRSVRSVnormalisedRSVmergetfandidfasforfixedandnon-nestedretrievalunits(IBMHaifa,INEX2003)7VectorspacesandXML¾Vectorspaces–tried+testedframeworkforkeywordretrieval™Other“bagofwords”applicationsintext:classification,clustering…¾Fortext-centricXMLretrieval,canwemakeuseofvectorspaceideas?¾Challenge:capturethestructureofanXMLdocumentinthevectorspace.8VectorspacesandXML¾Forinstance,distinguishbetweenthefollowingtwocasesBookTitleAuthorBillGatesMicrosoftBookTitleAuthorBillWulfThePearlyGates9Content-richXML:representationBookTitleAuthorBillMicrosoftBookTitleAuthorWulfPearlyGatesGatesTheBillLexiconterms.10EncodingtheGatesdifferently¾Whataretheaxesofthevectorspace?¾Intextretrieval,therewouldbeasingleaxisforGates¾Herewemustseparateoutthetwooccurrences,underAuthorandTitle¾Thus,axesmustrepresentnotonlyterms,butsomethingabouttheirpositioninanXMLtree11Queries¾Beforeaddressingthis,letusconsiderthekindsofquerieswewanttohandleBookTitleMicrosoftBookTitleAuthorGatesBill12Subtreesandstructure¾Considerallsubtreesofthedocumentthatincludeatleastonelexiconterm:BookTitleAuthorBillMicrosoftGatesBillMicrosoftGatesTitleMicrosoftAuthorBillAuthorGatesBookTitleMicrosoftBillBookAuthorGatese.g.…13Structuralterms:docs+queries¾Calleachoftheresulting(8+,inthepreviousslide)subtreesastructuralterm¾Createoneaxisinthevectorspaceforeachdistinctstructuralterm¾Eachdocumentbecomesavectorinthespaceofstructuralterms¾Aquerytreecanlikewisebefactoredintostructuralterms™Andrepresentedasavector™Allowsweightingportionsofthequery14StructuraltermsWeight¾Weightsbasedonfrequenciesfornumberofoccurrences(justaswehadtf)¾Alltheusualissueswithterms(stemming?Casefolding?)remain15Exampleoftfweighting¾Herethestructuraltermscontainingtoorbewouldhavemoreweightthanthosethatdon’tPlayActTobeornottobePlayActbePlayActorPlayActnotPlayActtoHowmanyaxesarethereinthisexample?16Down-weighting¾Forthedocontheleft:inastructuraltermrootedatthenodePlay,shouldn’tHamlethaveahighertfweightthanYorick?¾Idea:multiplytfcontributionofatermtoanodeklevelsupbyγk,forsomeγ1.PlayActAlaspoorYorickSceneTitleHamlet17Down-weightingexample,γ=0.8¾Forthedoconthepreviousslide,thetfof™Hamletismultipliedby0.8™Yorickismultipliedby0.64inanystructuraltermrootedatPlay.18Thenumberofstructuralterms¾Canbehuge!¾Impractical(不切实际的)tobuildavectorspaceindexwithsomanydimensions¾Willexaminepragmatic(注重实效的)solutionstothisshortly;fornow,continuetobelieve…Alright,howhuge,really?19Restrictstructuralterms?¾Dependingontheapplication,wemayrestrictthestructuralterms™E.g.,mayneverwanttoreturnaTitlenode,onlyBookorPlaynodes™Sodon’tenumerate/index/retrieve/scorestructuraltermsrootedatsomenodes¾Twosolutions™Query-timematerializationofaxes™Restrictthekindsofsubtreestoamanageableset20Query-timematerialization¾Insteadofenumeratingallstructuraltermsofalldocs(andthequery),enumerateonlyforthequery™Thelatterishopefullyasmallset¾Now,we’rereducedtocheckingwhichstructuralterm(s)fromthequerymatchasubtreeofanydocument¾Thisistreepatternmatching:givenatexttreeandapatterntree,findmatches™Exceptwehavemanytexttrees™Ourtreesarelabeledandweighted21Example¾HereweseekadocwithHamletinthetitle¾Onfindingthematchwecomputethecosinesimilarityscore¾Afterallmatchesarefound,rankbysortingQuery=HamletTitlePlayActAlaspoorYorickSceneText=HamletTitle22(Stillinfeasible(不可实行的))¾AdocwithYoricksomewhereinit:¾Query=¾Willgettoit…YorickTitle23Restrictingthesubtrees¾Enumeratingallstructuralterms(subtrees)isprohibitiv

1 / 62
下载文档,编辑使用

©2015-2020 m.777doc.com 三七文档.

备案号:鲁ICP备2024069028号-1 客服联系 QQ:2149211541

×
保存成功