1文本特征提取技术(2)杨建武Email:yangjianwu@icst.pku.edu.cn第二章:北京大学计算机科学技术研究所文本挖掘技术(2010春)2主要内容分词与词性标注文档模型与相似度计算布尔模型向量空间模型概率模型特征变换隐语义分析(LSA)3文档模型与相似度计算4文档模型布尔模型向量空间模型概率模型5布尔模型建立在经典的集合论和布尔代数的基础上每个词在一篇文档中是否出现,对应权值为0或1文档检索布尔逻辑运算D=1,1,1,0,1,1,0Q=1,0,1,0,0,1,1retrievaldatabasearchitecturecomputertextmanagementinformation6布尔模型优点:简单、易理解、简洁的形式化。缺点:准确匹配,信息需求的能力表达不足。扩展布尔模型p-norm模型[Saltonetal.1983]7向量空间模型(VSM)向量空间模型中将文档表达为向量空间中的一个矢量或一个点空间维(坐标轴)词权重?相似度?T3T1T2D1=2T1+3T2+5T3D2=3T1+7T2+T3Q=0T1+0T2+2T373258TermWeightsThewordsofatextarenotequallyindicativeofitsmeaning“Mostscientiststhinkthatbutterfliesusethepositionofthesunintheskyasakindofcompass(指南针)thatallowsthemtodeterminewhichwayisnorth.”Important:butterflies,north,sun,scientists,compassUnimportant:most,think,kind,sky,Termweightsreflectthe(estimated)importanceofeachterm9Termfrequency(tf)Themoreoftenawordoccursinadocument,thebetterthattermisindescribingwhatthedocumentisaboutOftennormalized,e.g.bythelengthofthedocument10Inversedocumentfrequency(idf)Termsthatappearinmanydocumentsinthecollectionarenotveryusefulfordistinguishingarelevantdocumentfromanon-relevantoneDocumentfrequency(df):numberofdocumentscontainingthetermTFIDF:TF*IDF11SimilarityMeasureasimilaritymeasurecanrepresentthesimilaritybetweentwodocuments,twoqueries,oronedocumentandonequeryitispossibletoranktheretrieveddocumentsintheorderofpresumedimportanceAsimilaritymeasureisafunctionwhichcomputesthedegreeofsimilaritybetweenapairoftextobjectsTherearealargenumberofsimilaritymeasuresproposedintheliterature,becausethebestsimilaritymeasuredoesn'texist(yet!)12基于VSM的相关度计算方法欧氏距离向量内积向量夹角余弦T3T1T2D1=2T1+3T2+5T3D2=3T1+7T2+T3Q=0T1+0T2+2T37325Example:D1=2T1+3T2+5T3D2=3T1+7T2+T3Q=0T1+0T2+2T3•IsD1orD2moresimilartoQ?•Howtomeasurethedegreeofsimilarity?Distance?Angle?Projection?13欧氏距离tkkkyxyxyxDis12)(||),(WeightedD1=2T1+3T2+5T3D2=3T1+7T2+T3sim(D1,D2)=(1+16+16)=33=5.74空间两点的欧氏距离:该方法很少使用?14向量内积相似度Binary:D=1,1,1,0,1,1,0Q=1,0,1,0,0,1,1sim(D,Q)=3retrievaldatabasearchitecturecomputertextmanagementinformationWeighted:D1=2T1+3T2+5T3D2=3T1+7T2+T3Q=0T1+0T2+2T3sim(D1,Q)=2*0+3*0+5*2=10sim(D2,Q)=3*0+7*0+1*2=2tkkkyxyxyxSim1)(),(?15余弦相似度Q=0T1+0T2+2T3D1=2T1+3T2+5T3CosSim(D1,Q)=10/(38*4)=0.81D2=3T1+7T2+T3CosSim(D2,Q)=2/(59*4)=0.13t3t1t2D1D2QtktktkkkyxyxkkyxyxyxSim12121)(||||),(两个向量间夹角的余弦值作为文档间的相似度余弦相似度与内积相似度的差别?Innerproductnormalizedbythevectorlengths?Q=0T1+0T2+2T3D1=2T1+3T2+5T3CosSim(D1,Q)=10/(38*4)=0.81D2=3T1+7T2+T3CosSim(D2,Q)=2/(59*4)=0.13?16余弦相似度||||||||),(yyxxyxyxyxSimtkxkxxxx12||'效率:大量计算两两文档间相似都时,为降低计算量,先对文档进行向量进行单位化tkkkyxyxSim1)(),(''17余弦相似度的应用SmartretrievalsystemG.Saltoned.TheSMARTRetrievalSystem—ExperimentsinAutomaticDocumentRetrievalEnglewoodCliff,NJ:PrenticeHallInc.;1971G.Salton,M.J.McGillIntroductiontoModernInformationRetrievalMcGraw-HillBookCo.NewYork,NY,USA1986tkdktkqktkdkqk)()()(),(18余弦相似度的应用G.Salton,C.Buckley,Termweightingapproachesinautomatictextretrieval.InformationProcessingandManagement24(5):513–523,198819余弦相似度的应用20余弦相似度的应用21D1=2T1+3T2+5T3Sim(D1,Q)=10/(38+4-10)=10/32=0.31D2=3T1+7T2+T3Sim(D2,Q)=2/(59+4-2)=2/61=0.04Q=0T1+0T2+2T3DifferencebetweenJaccardandCosSine?tktkkktktkkkyxyxyxkkyxSim111221)()(),(JaccardCoefficient:Jaccard相似度22基于向量内积的几种方法的对比tktkkktktkkkyxyxyxkk111221)()(InnerProduct:Cosine:Jaccard:tktktkkkyxyxkk11221)(tkkkyx1)(23VSMSummaryBasedonoccurrencefrequenciesonlyConsiderbothlocalandglobaloccurrencefrequenciesAdvantagessimplicityabletohandleweightedtermseasytomodifytermvectorsDisadvantagesassumptionoftermindependencelackthecontrolofBooleanmodel(e.g.requiringatermtoappearinadocument)24R表示相关文档集;表示R的补集;表示文档dj与查询q相关的概率;表示文档dj与查询q不相关的概率;文档概率模型R(|)jPRd(|)jPRd(|)(,)(|)jjjPRdsimdqPRd文档dj与查询q的相似度定义为:根据贝叶斯定理有根据贝叶斯定理有)|()|(~)()|()()|()|()|(),(RdpRdpRpRdpRpRdpdRpdRpqdsimjjjjjjj)()()|()|(BPAPABPBAP25文档概率模型独立性假设:P(AB)=P(A)P(B)当且仅当A与B相互独立对一篇文档而言,若文档中的各个词相互独立,则有P(dj)=P(k1)…P(kt)假设各词独立,则(Ki:所有的词))|()|(~),(RdpRdpqdsimjjj))|(())|(())|(())|((0)(1)(0)(1)(jijijijidgidgidgidgiRkpRkpRkpRkp26文档概率模型)|()|(~),(RdpRdpqdsimjjj))|(())|(())|(())|((0)(1)(0)(1)(jijijijidgidgidgidgiRkpRkpRkpRkp))|(/)|(())|(())|(/)|(())|((1)(1)(1)(1)(jijijijidgiidgidgiidgiRkpRkpRkpRkpRkpRkp))|(())|(())|(())|(())|(())|((1)(1)(1)(1)(RkpRkpRkpRkpRkpRkpidgidgiidgidgijijijiji27V:相关文档子集文档数,Vi:包含索引词ki的相关文档数。文档概率模型(|)iiVPkRV(|)iiinVPkRNV2).不相关文档中的索引词ki的分布可以通过文档集中索引词的分布来估计ni表示包含索引词ki的文档数,N表示集合中的文档总数。取对数,在相同背景下,忽略恒定不变的因子:取对数,在相同背景下,忽略恒定不变的因子:{0,1},{0,1}ijiqww词权重词权重tiiiiiijiqjRkpRkpRkpRkpwwqdsim1)|()|(1log)|(1)|(log~),(1).用相关文档中索引词ki的分布来估计:28N表示集合中的文档总数;V:相关文档子集文档数;ni表示包含索引词ki的文档数;Vi:包含索引词ki的相关文档数。文档概率模型)|()|(1log)|(1)|(logRkpRkpRkpRkpwiiiii)1)()(log(/1/logiiiiVnVNVVVV)()()(logiiiiiiVnVnVNVVVtiiijiqj~),(29文档概率模型))(()(log)(rnrRrnRNrassignediwVRVirN表示集合中的文档总数;R:相关文档子集文档数;n表示包含索引词ki