TextMining14-XML

整理文档很辛苦,赏杯茶钱您下走!

免费阅读已结束,点击下载阅读编辑剩下 ...

阅读已结束,您可以下载文档离线阅读编辑

资源描述

1半结构化文本挖掘杨建武Email:yangjw@pku.edu.cn第十四章:北京大学计算机科学技术研究所文本挖掘技术(2012春)2Text-centricXMLretrievalDocumentsmarkedupasXMLE.g.,assemblymanuals,journalissuesQueriesareuserinformationneedsE.g.,givemetheSection(element)ofthedocumentthattellsmehowtochangeabrakelightBookChaptersSectionsSubsectionsWorldWideWebThisisonlyonlyanothertolookoneletoshowtheneedanlaaoutstructureofandmoreadocumentandsoasstoitdoenotnecessarytextastructureddocumenthaveretrievalonthewebisanitimportanttopicoftoday’sresearchitissuestomakeselastsentence..3ConceptualmodelStructureddocumentsContent+structureInvertedfile+structureindextf,idf,…Matchingcontent+structurePresentationofrelatedcomponentsDocumentsQueryDocumentrepresentationRetrievalresultsQueryrepresentationIndexingFormulationRetrievalfunctionRelevancefeedback4Approaches…vectorspacemodelprobabilisticmodelBayesiannetworklanguagemodelextendingDBmodelBooleanmodelnaturallanguageprocessingcognitivemodelontologyparameterestimationtuningsmoothingfusionphrasetermstatisticscollectionstatisticscomponentstatisticsproximitysearchlogisticregressionbeliefmodelrelevancefeedbackdivergencefromrandomnessmachinelearning5elementlanguagemodelcollectionlanguagemodelsmoothingparameterelementscoreelementsizeelementscorearticlescorequeryexpansionwithblindfeedbackignoreelementswith20termshighvalueofleadstoincreaseinsizeofretrievedelementsresultswith=0.9,0.5and0.2similarrankelement(UniversityofAmsterdam,INEX2003)Languagemodel6Vectorspacemodelarticleindexabstractindexsectionindexsub-sectionindexparagraphindexRSVnormalisedRSVRSVnormalisedRSVRSVnormalisedRSVRSVnormalisedRSVRSVnormalisedRSVmergetfandidfasforfixedandnon-nestedretrievalunits(IBMHaifa,INEX2003)7VectorspacesandXMLVectorspacestried+testedframeworkforkeywordretrievalOtherbagofwordsapplicationsintext:classification,clusteringFortext-centricXMLretrieval,canwemakeuseofvectorspaceideas?Challenge:capturethestructureofanXMLdocumentinthevectorspace.8VectorspacesandXMLForinstance,distinguishbetweenthefollowingtwocasesBillGatesMicrosoftBillWulfThePearlyGates9Content-richXML:representationBillMicrosoftWulfPearlyGatesGatesTheBill10EncodingtheGatesdifferentlyWhataretheaxesofthevectorspace?Intextretrieval,therewouldbeasingleaxisforGatesHerewemustseparateoutthetwooccurrences,underAuthorandTitleThus,axesmustrepresentnotonlyterms,butsomethingabouttheirpositioninanXMLtree11QueriesBeforeaddressingthis,letusconsiderthekindsofquerieswewanttohandleMicrosoftGatesBill12SubtreesandstructureConsiderallsubtreesofthedocumentthatincludeatleastonelexiconterm:BillMicrosoftGatesBillMicrosoftGatesMicrosoftBillGatesMicrosoftBillGates13Structuralterms:docs+queriesCalleachoftheresulting(8+,inthepreviousslide)subtreesastructuraltermCreateoneaxisinthevectorspaceforeachdistinctstructuraltermEachdocumentbecomesavectorinthespaceofstructuraltermsAquerytreecanlikewisebefactoredintostructuraltermsAndrepresentedasavectorAllowsweightingportionsofthequery14StructuraltermsWeightWeightsbasedonfrequenciesfornumberofoccurrences(justaswehadtf)Alltheusualissueswithterms(stemming?Casefolding?)remain15ExampleoftfweightingHerethestructuraltermscontainingtoorbewouldhavemoreweightthanthosethatdontTobeornottobebeornotto16Down-weightingForthedocontheleft:inastructuraltermrootedatthenodePlay,shouldntHamlethaveahighertfweightthanYorick?Idea:multiplytfcontributionofatermtoanodeklevelsupbyk,forsome1.AlaspoorYorickHamlet17Down-weightingexample,=0.8Forthedoconthepreviousslide,thetfofHamletismultipliedby0.8Yorickismultipliedby0.64inanystructuraltermrootedatPlay.18ThenumberofstructuraltermsCanbehuge!Impractical(不切实际的)tobuildavectorspaceindexwithsomanydimensionsWillexaminepragmatic(注重实效的)solutionstothisshortly;fornow,continuetobelieve19Restrictstructuralterms?Dependingontheapplication,wemayrestrictthestructuraltermsE.g.,mayneverwanttoreturnaTitlenode,onlyBookorPlaynodesSodontenumerate/index/retrieve/scorestructuraltermsrootedatsomenodesTwosolutionsQuery-timematerializationofaxesRestrictthekindsofsubtreestoamanageableset20Query-timematerializationHereweseekadocwithHamletinthetitleOnfindingthematchwecomputethecosinesimilarityscoreAfterallmatchesarefound,rankbysortingHamletAlaspoorYorickHamletInsteadofenumeratingallstructuraltermsofalldocs(andthequery),enumerateonlyforthequery21RestrictingthesubtreesEnumeratingallstructuralterms(subtrees)isprohibitive,forindexingMostsubtreesmayneverbeusedinprocessinganyqueryCanwegetawaywithindexingarestrictedclassofsubtreesIdeallyfocusonsubtreeslikelytoariseinqueriesOnlypathsincludingalexiconterm(IBMHaifa)22ExampleofaretrievalstepMatch=23XQuery24XQuerySQLforXMLUsagescenariosHuman-readabledocumentsData-orienteddocumentsMixeddocuments(e.g.,patientrecords)ReliesonXPathXMLSchemadatatypes25XQueryTheprincipalformsofXQueryexpressionsare:pathexpressionselementconstructorsFLWR(flower)expressionslistexpressionsconditionalexpressionsquantifiedexpressionsdatatypeexpressionsEvaluatedwithrespecttoacontext26FLWRFOR$pINdocument(bib.xml)//publish

1 / 53
下载文档,编辑使用

©2015-2020 m.777doc.com 三七文档.

备案号:鲁ICP备2024069028号-1 客服联系 QQ:2149211541

×
保存成功