贝叶斯层次聚类及其在文本挖掘中的应用

整理文档很辛苦,赏杯茶钱您下走!

免费阅读已结束,点击下载阅读编辑剩下 ...

阅读已结束,您可以下载文档离线阅读编辑

资源描述

I摘要贝叶斯层次聚类及其在文本挖掘中的应用姜宁(计算机软件与理论)(导师:史忠植)随着互联网中信息的日益增长,通过文本挖掘,快速、准确地检索信息和分类信息成为人们日益迫切的要求,具有广泛的应用前景和实用价值。本文对文本数据挖掘中的一种重要方法——聚类分析进行了广泛而深入的探讨。通过对以文本数据为代表的高维特征空间特点的分析,本文主要从概率角度,特别是用贝叶斯方法,来研究文本数据的聚类分析。本文的研究工作主要集中在以下几个方面:1)基于文档信息量变化的概率层次聚类。依据信息论的思想,从文档信息量变化的角度,对文本聚类的过程进行了分析,研究了信息量在层次聚类过程中所呈现的规律性,进而提出一种基于信息量模型的聚类分析算法。采用贝叶斯方法对算法进行深入地分析表明,信息量聚类的概率解释就是贝叶斯模型的对数似然比。2)贝叶斯模型选择在聚类分析中的应用。通过研究该算法的概率解释,文中从问题域出发,对文章中特征序列的随机产生过程进行了讨论,给出了一个具体的物理模型。同时,我们对聚类分析中的模型选择,特别是混合模型方法,做出了较全面地介绍与总结,对其中的关键技术逐一进行了讨论。在此基础上,我们给出了贝叶斯后验模型,并把它与物理模型相结合,提出一个采用贝叶斯后验概率模型的层次聚类算法。对真实文本数据的测试中,该算法获得了很高的聚类准确率。3)无监督学习中聚类准确度的评价。不同于分类问题,在聚类分析中如何客观的评价聚类结果并没有一个普遍认II同的标准。本文对聚类算法的评价中采用了平均准确率,为此深入的讨论了PA、NA指标在无监督学习中对查全率和查准率地反映。发现了它们和召回率、精度之间的内在联系。4)高维特征空间中的特征约简特征约简可以大幅度的提高聚类的速度,而对聚类的准确率影响不大。本文的最后,讨论了一种基于特征联合概率的、高效的特征相似性度量,将其应用于特征聚类,并对文中涉及的各种算法进行了实验,取得了满意的效果。值得一提的是,一些算法在约简后的特征集中进行聚类时,准确率获得了大幅度的提高。关键词:文本挖掘,层次聚类,信息熵,模型选择,混合模型,贝叶斯后验模型,贝叶斯估计,平均准确率,PA/NA,特征聚类IIIABSTRACTHierarchicalBayesianClusteringanditsApplicationtoTextMiningJiangNing(ComputerSoftwareandTheory)SupervisedbyProfessorShiZhongzhiWithrapidgrowthofinformationonInternet,advancedinformationretrievaltechniquesofhighperformanceandhighaccuracyareincreasinglydemandedbyindustry,whichmayhaveapotentialtoleadtoarevolutioninthewaythatpeopleareusingInternet.Textclustering,orunsupervisedtextclassification,isaprimarymethodusedininformationretrieval.Themethodhasbeenreceivingincreasingattentionfromthecommunity,asitdoesnotneedmanuallyclassifiedtextfortrainingandthereforemoresuitableforlarge-scaleInternettextclassificationtasks.Thisthesisdiscussestextclusteringtechniquesindepth.ThethesisinvestigatestextclusteringfromaprobabilisticpointofviewwithemphasisonBayesianapproaches.Thecontentisorganisedintothefollowingsections:1)Probabilistichierarchicalclusteringbasedondocumentinformationquantity.Fromaninformationtheoryangle,westudylatentrelationsbetweendocumentinformationquantityanddocumentclassification.Ahierarchicaltextclusteringalgorithmisproposedbasedondocumentinformationquantity.Theoreticalanalysisshowsperformanceofthisalgorithmcanbeexplainedusingthelogarithmlikelihoodratioofprobabilisticmodels.2)BayesianposteriormodelselectionanditsApplicationtoclusteringanalysis.Modelselectionhasbeenshownasanefficienttechniqueforclusteringanalysis.Thethesisintroducesanewmodelselectionapproach,Bayesianposteriormodelselection,whichgreatlyreducescomputationalcomplexityofmoduleselectionwhenusingwithmixturemodelsandimprovesaccuracyofChinesetextclustering.TwoBayesianestimationIVtechniques,MaximumLikelihoodEstimationandConditionalExpectationEstimation,arecomparedinthiscontext.Ahierarchicalclusteringalgorithmfortextclusteringbasedonthisnewapproachisproposed.Experimentalresultsofhighaccuracyhavebeenachievedforreal-worldtextclustering.3)Comparingclassificationaccuracyofunsupervisedlearning.Differentfromsupervisedclassification,thereisnotacommonlyacceptedmethodofcomparingclusteringalgorithms.Inthethesis,theAverageAccuracycriterionisadaptedtocomparealgorithmsandanin-depthdiscussionofrecallandprecisioninthecontextofunsupervisedlearning(PA,NA)isgiven.4)Featureclusteringonhigh-dimensionalfeaturespace.Featureclusteringcanremarkablyreducetheoverheadofclusteringanalysisonhigh-dimensionaldatasets,withoutsignificanteffectonaccuracy.Anewprobabilisticfeaturesimilaritymeasureispresentedinthisthesis.Themeasureallowsforefficientfeatureclusteringonextremelylarge-scaledatasets,becauseitstimecomplexityisindependentofthenumberofdocuments.Keywords:TextMining,HierarchicalClustering,InformationEntropy,ModelSelection,MixtureModels,BayesianPosteriorModel,BayesianEstimation,AverageAccuracy,PA/NA,FeatureClusteringV目录摘要·························································································································IABSTRACT···············································································································III目录························································································································V第一章绪论··············································································································71.1研究背景及意义····································································································71.1.1文本数据挖掘···························································································································71.1.2Web内容挖掘································································································································91.2国内外研究现状分析······························································································91.2.1聚类分析研究方法与手段·····································································································101.2.1.1数据的表示·······································································································101.2.1.2聚类的相似性度量······························································································111.2.1.3聚类的搜索算法···············································

1 / 72
下载文档,编辑使用

©2015-2020 m.777doc.com 三七文档.

备案号:鲁ICP备2024069028号-1 客服联系 QQ:2149211541

×
保存成功