Using linear algebra for intelligent information r

整理文档很辛苦,赏杯茶钱您下走!

免费阅读已结束,点击下载阅读编辑剩下 ...

阅读已结束,您可以下载文档离线阅读编辑

资源描述

UsingLinearAlgebraforIntelligentInformationRetrievalMichaelW.Berry&SusanT.DumaisComputerScienceDepartmentCS-94-270December1994USINGLINEARALGEBRAFORINTELLIGENTINFORMATIONRETRIEVALMICHAELW.BERRYyANDSUSANT.DUMAISzAbstract.Currently,mostapproachestoretrievingtextualmaterialsfromscienticdatabasesdependonalexicalmatchbetweenwordsinusers’requestsandthoseinorassignedtodocumentsinadatabase.Becauseofthetremendousdiversityinthewordspeopleusetodescribethesamedocument,lexicalmethodsarenecessarilyincompleteandimprecise.Usingthesingularvaluedecomposition(SVD),onecantakeadvantageoftheimplicithigher-orderstructureintheassociationoftermswithdocumentsbydeterminingtheSVDoflargesparsetermbydocumentmatrices.Termsanddocumentsrepresentedby200-300ofthelargestsingularvectorsarethenmatchedagainstuserqueries.WecallthisretrievalmethodLatentSemanticIndexing(LSI)becausethesubspacerepresentsimportantassociativerelationshipsbetweentermsanddocumentsthatarenotevidentinindividualdocuments.LSIisacompletelyautomaticyetintelligentindexingmethod,widelyapplicable,andapromisingwaytoimproveusers’accesstomanykindsoftextualmaterials,ortodocumentsandservicesforwhichtextualdescriptionsareavailable.AsurveyofthecomputationalrequirementsformanagingLSI-encodeddatabasesaswellascurrentandfutureapplicationsofLSIispresented.Keywords.indexing,information,latent,matrices,retrieval,semantic,singularvaluedecomposition,sparse,updatingAMS(MOS)subjectclassications.15A18,15A48,65F15,65F50,68P201.Introduction.Typically,informationisretrievedbyliterallymatchingtermsindocumentswiththoseofaquery.However,lexicalmatchingmethodscanbeinaccuratewhentheyareusedtomatchauser’squery.Sincethereareusuallymanywaystoexpressagivenconcept(synonymy),theliteraltermsinauser’squerymaynotmatchthoseofarelevantdocument.Inaddition,mostwordshavemultiplemeanings(polysemy),sotermsinauser’squerywillliterallymatchtermsinirrelevantdocuments.Abetterapproachwouldallowuserstoretrieveinformationonthebasisofaconceptualtopicormeaningofadocument.LatentSemanticIndexing(LSI)[4]triestoovercometheproblemsoflexicalmatchingbyusingstatisticallyderivedconceptualindicesinsteadofindividualwordsforretrieval.LSIassumesthatthereissomeunderlyingorlatentstructureinwordusagethatisparticallyobscuredbyvariabilityinwordchoice.Atruncatedsingularvaluedecomposition(SVD)[14]isusedtoestimatethestructureinwordusageacrossdocuments.RetrievalisthenperformedusingthedatabaseofsingularvaluesandvectorsobtainedfromthetruncatedSVD.Performancedatashowsthatthesestatisticallyderivedvectorsaremorerobustindicatorsofmeaningthanindividualterms.Anumberofsoftwaretoolshavebeendevelopedtoperformoperationssuchasparsingdocumenttexts,creatingatermbydocumentmatrix,computingthetruncatedSVDofthismatrix,creatingtheLSIdatabaseofsingularvaluesandvectorsforretrieval,matchinguserqueriestodocuments,andaddingnewtermsordocumentstoanexistingLSIdatabases[4,23].ThebulkofLSIprocessingtimeisspentincomputingthetruncatedSVDofthelargesparsetermbydocumentmatrices.Section2isareviewofbasicconceptsneededtounderstandLSI.Section3usesaconstructiveexampletoillustratehowLSIrepresentstermsanddocumentsinthesamesemanticspace,howaqueryisrepresented,howadditionaldocumentsareadded(orfolded-in),andhowSVD-updatingrepresentsadditionaldocuments.InSection4,analgorithmforSVD-updatingisdiscussedalongwithacomparisontothefolding-inprocesswithregardtorobustnessofquerymatchingandcomputationalcomplexity.Section5surveyspromisingapplicationsofLSIalongwithparameterestimationproblemsthatarisewithitsuse.ThisresearchwassupportedbytheNationalScienceFoundationundergrantNos.NSF-CDA-9115428andNSF-ASC-92-03004.SubmittedtoSIAMReview.yDepartmentofComputerScience,107AyresHall,UniversityofTennessee,Knoxville,TN37996-1301,berry@cs.utk.edu.zInformationScienceResearchGroup,Bellcore,445SouthStreet,Room2L-371,Morristown,NJ07962-1910,std@bellcore.com.2UsingLinearAlgebraforIntelligentInformationRetrieval32.Background.Thesingularvaluedecompositioniscommonlyusedinthesolutionofuncon-strainedlinearleastsquaresproblems,matrixrankestimation,andcanonicalcorrelationanalysis[2].GivenanmnmatrixA,wherewithoutlossofgeneralitymnandrank(A)=r,thesingularvaluedecompositionofA,denotedbySVD(A),isdenedasA=UVT(1)whereUTU=VTV=Inand=diag(1;;n);i0for1ir;j=0forjr+1.TherstrcolumnsoftheorthogonalmatricesUandVdenetheorthonormaleigenvectorsassociatedwiththernonzeroeigenvaluesofAATandATA,respectively.ThecolumnsofUandVarereferredtoastheleftandrightsingularvectors,respectively,andthesingularvaluesofAaredenedasthediagonalelementsofwhicharethenonnegativesquarerootsoftheneigenvaluesofAAT[14].ThefollowingtwotheoremsillustratehowtheSVDcanrevealimportantinformationaboutthestructureofamatrix.Theorem2.1.LettheSVDofAbegivenbyEquation(1)and12rr+1==n=0andletR(A)andN(A)denotetherangeandnullspaceofA,respectively.Then,1.rankproperty:rank(A)=r,N(A)spanfvr+1;;vng,andR(A)spanfu1;;urg,whereU=[u1u2um]andV=[v1v2vn]:2.dyadicdecomposition:A=rXi=1uiivTi:3.norms:kAk2F=21++2r,andkAk22=1:Proof.See[14].Theorem2.2.[Ecka

1 / 24
下载文档,编辑使用

©2015-2020 m.777doc.com 三七文档.

备案号:鲁ICP备2024069028号-1 客服联系 QQ:2149211541

×
保存成功