1文本挖掘技术杨建武Email:yangjw@pku.edu.cn复习北京大学计算机科学技术研究所文本挖掘技术(2012春)2课程主要内容第一章:引言(2学时)第二章:文本特征提取技术(4学时)第三章:文本检索技术(6学时)第四章:文本自动分类技术(3学时)第五章:文本自动聚类技术(3学时)第六章:话题检测追踪技术(3学时)第七章:文本过滤技术(3学时)第八章:关联分析技术(1学时)第九章:文档自动摘要技术(2学时)第十章:信息抽取(3学时)第十一章:智能问答(QA)技术(3学时)第十二章:文本情感分析技术(3学时)第十三章:Ontology技术(3学时)第十四章:半结构化文本挖掘方法(1.5学时)第十五章:文本挖掘工具与应用(1.5学时)3文本挖掘的概念“文本挖掘”TextMining,TextDataMining,KnowledgeDiscoveryinText,KnowledgeDiscoveryinTextualData(bases)Textminingmainlyisaboutsomehowextractingtheinformationandknowledgefromtext对KDD定义进行扩展,文本挖掘是从大量文本数据中抽取隐含的,未知的,可能有用的信息。4文本挖掘模型结构示意图文本源文本结构分析分词文本分析名字识别日期处理数字处理用户界面结果展示用户浏览检索结果词性标注特征提取特征词及权重关键词摘要特定信息抽取分类聚类过滤检索TDT5语言理解系统文本分句词法分析/分词词性标注短语分析句法分析语义分析语篇分析理解一种语言另一种语言跨语言处理语篇结构/命题网络语义结构„„句法及句法功能结构短语结构词性序列标准化词序列句子序列自然形态面向不同应用有不同的形式6文档模型布尔模型向量空间模型概率模型7向量空间模型(VSM)向量空间模型中将文档表达为一个矢量,看作向量空间中的一个点Example:D1=2T1+3T2+5T3D2=3T1+7T2+T3Q=0T1+0T2+2T3T3T1T2D1=2T1+3T2+5T3D2=3T1+7T2+T3Q=0T1+0T2+2T37325•IsD1orD2moresimilartoQ?•Howtomeasurethedegreeofsimilarity?Distance?Angle?Projection?8Termfrequency(tf)Themoreoftenawordoccursinadocument,thebetterthattermisindescribingwhatthedocumentisaboutOftennormalized,e.g.bythelengthofthedocument9Inversedocumentfrequency(idf)Termsthatappearinmanydocumentsinthecollectionarenotveryusefulfordistinguishingarelevantdocumentfromanon-relevantoneDocumentfrequency(df):numberofdocumentscontainingthetermTFIDF:TF*IDF10基于VSM的相关度计算方法基于向量空间模型的常用方法向量内积向量夹角余弦T3T1T2D1=2T1+3T2+5T3D2=3T1+7T2+T3Q=0T1+0T2+2T3732511向量内积相似度Binary:D=1,1,1,0,1,1,0Q=1,0,1,0,0,1,1sim(D,Q)=3WeightedD1=2T1+3T2+5T3D2=3T1+7T2+T3Q=0T1+0T2+2T3sim(D1,Q)=2*0+3*0+5*2=10sim(D2,Q)=3*0+7*0+1*2=2tkkkyxyxyxSim1)(),(12余弦相似度Q=0T1+0T2+2T3D1=2T1+3T2+5T3CosSim(D1,Q)=10/(38*4)=0.81D2=3T1+7T2+T3CosSim(D2,Q)=2/(59*4)=0.132t3t1t2D1D2Q1tktktkkkyxyxkkyxyxyxSim12121)(||||),(两个向量间夹角的余弦值作为文档间的相似度余弦相似度与内积相似度的差别?Innerproductnormalizedbythevectorlengths13词频矩阵表示文档词频的词频矩阵d1d2d3d4d5d6t132285356915320t236190765713370t325331604822126t430140702011635词频矩阵:矩阵表示一组文档行对应关键词t,列对应文档d向量将每一个文档视为空间的一个向量向量值反映单词t与文档d的关联度14文档特征分析实例文档集(9个文档)c1HumanmachineinterfaceforLabABCcomputerapplicationc2Asurveyofuseropinionofcomputersystemresponsetimec3TheEPSuserinterfacemanagementsystemc4SystemandhumansystemengineeringtestingofEPSc5Relationsofuser-perceivedresponsetimetoerrormeasurementm1Thegenerationofrandom,binary,unorderedtreesm2Theintersectiongraphofpathsintreesm3GraphminorsIV:Widthsoftreesandwell-quasi-orderingm4Graphminors:Asurvey分词,选择特征词,过滤常用词15文档向量化c1c2c3c4c5m1m2m3m4human100100000interface101000000computer110000000user011010000system011200000response010010000time010010000EPS001100000survey010000001trees000001110graph000000111minors000000011文档向量化(取12个特征词)16文档间相似度||||),(yxyxyxSimtktktkkkyxyxkk12121)(文档间相似度点积余弦tkkkyxyxyxSim1)(),(AAMT),(),(),(),(jjiijijiMMMM17文档间相似度(余弦)c1c2c3c4c5m1m2M3m4c110.240.290.2400000c20.2410.410.330.710000.24c30.290.4110.610.290000c40.240.330.61100000c500.710.29010000m10000010.710.580m2000000.7110.820.41m3000000.580.8210.67m400.2400000.410.671||||),(yxyxyxSimtktktkkkyxyxkk12121)(18查询:相关度||||),(yxyxyxSimQuery:Human-ComputerInteractiontktktkkkyxyxkk12121)(c1c2c3c4c5m1m2m3m4human100100000interface101000000computer110000000user011010000system011200000response010010000time010010000EPS001100000survey010000001trees000001110graph000000111minors000000011c1c2c3c4c5m1m2m3m4Query0.820.2800.3500000qT=10100000000019查询结果c1HumanmachineinterfaceforLabABCcomputerapplicationc2Asurveyofuseropinionofcomputersystemresponsetimec3TheEPSuserinterfacemanagementsystemc4SystemandhumansystemengineeringtestingofEPSc5Relationsofuser-perceivedresponsetimetoerrormeasurementm1Thegenerationofrandom,binary,unorderedtreesm2Theintersectiongraphofpathsintreesm3GraphminorsIV:Widthsoftreesandwell-quasi-orderingm4Graphminors:Asurveyc1c2c3c4c5m1m2m3m4Query0.820.2800.3500000Query:Human-ComputerInteraction20词频矩阵词频矩阵(12*9)(12个特征维,9个文档)(t=12,d=9)100100000101000000110000000011010000011200000010010000010010000001100000010000001000001110000000111000000011A=21SVD分解降维VUATkkkkRowsofUk=terms;RowsofVk=documents22SVD变换空间的相似度|)(|||))((),()()(iTkkiTkTkiVSqVSqDQsimq’kT=0.46-0.07Vkc1c2c3c4c5m1m2m3m40.200.610.460.540.280.000.020.020.08-0.060.17-0.13-0.230.110.190.440.620.53c1c2c3c4c5m1m2m3m40.990.930.990.980.90-0.01-0.09-0.110.04),(iDQsim3.342.54SkkTTkUqqTkkSVA变换函数:TkU23检索结果对比qT(101000000000)0.82c1–humaninterfacecom0.35c4–systemhumansystemeps0.28c2–surveyusercomputersystemresponsetime0.00c3–epsuserinterfacesystem0.00c5–userresponsetime0.00m1–trees0.00m2–graphtrees0.00m3–graphminorstrees0.00m4–graphminorssurveyq’kT(0.46-0.07)0.99c3–epsuserinterfacesystem0.99c1–humaninterfacecomputer0.98c4–systemhumansystemeps0.93c2–surveyusercomputersystemresponsetime0.90c5–userresponsetime0.04m4–graphminorssurvey-0.01m1–trees-0.09m2–graphtrees-0.11m3–graphminorstreesQuery:Human-ComputerInteraction24检索质量的评价{relevant}={A,B,C,D,E,F,G,H,I,J}=10{retrieved}={B,D,F,W,Y}=5{relevant}∩{retrieved}={B,D,F}=3查准率