本科毕业设计题目:自动网页主题聚类与分类——自动网页主题分类方法研究自动网页主题分类方法研究摘要随着科技的发展,Internet上传播的信息越来越多。Internet的飞速发展使得Web信息呈爆炸式增长。文本分类技术面对Internet爆炸式增长的信息处理需求带来了巨大的挑战。对与大规模数据的分类任务,算法的扩展性及Web页分类等问题是当前自动文档分类研究的热点问题。本文研究了自动网页主题分类方法。本文先介绍了当前主流的文本分类的相关技术,分析各个步骤中各种方法的优点和缺点。因为当前没有统一的语料库,本文构建了IT领域语料库。主要的工作有新闻稿的提取,训练集的选择,特性权重的计算等,并分析IT领域文本分类模型的特点。针对当前主流的文本分类技术各有优劣,根据IT领域语料库的特点,结合各个步骤中各种方法的优点和缺点与IT领域文本分类模型的特点进行分析,提出了一种新的方法,构建了朴素贝叶斯和支持向量机的组合分类器。并对该算法进行了验证。关键字:自动网页主题文本分类IT领域语料库组合分类器StudyonAutomaticWebpagetopicclassificationmethodAbstractWiththedevelopmentofscienceandtechnology,moreandmoreinformationontheInternetpropagation.TherapiddevelopmentofInternetmakesWebinformationexplosivegrowth.TextclassificationtechnologyoffaceinformationprocessingrequirementsofInternetexplosivegrowthhasbroughtgreatchallenges.Fortheclassificationtaskwithmassdata,scalabilityofalgorithmsandWebpageclassificationproblemisahotprobleminautomaticdocumentclassificationresearch.Thispaperstudiestheautomaticwebpagetopicclassificationmethod.Firstintroducestherelatedtechnologyoftextcategorizationisthecurrentmainstream,variousmethodsfortheanalysisoftheadvantagesanddisadvantagesofeachstep.Becausethereiscurrentlynounifiedcorpus,thispaperconstructstheITdomaincorpus.Themainworkistoextractthepressrelease,thetrainingsetselection,featureweightcalculation,andanalyzethecharacteristicsofITdomaintextclassificationmodel.Accordingtothecurrenttextcategorizationtechnologymainstreamhavetheirprosandcons,accordingtothecharacteristicsofITdomaincorpus,analysisofthecharacteristicsofacombinationofvariousmethodsineachstepoftheadvantagesanddisadvantagesofITdomaintextclassificationmodel,anewmethodisproposedtoconstructasimple,Biasandsupportvectormachineclassifiercombination.Andthealgorithmisverified.Keywords:AutomaticwebpagetopicTextclassificationThefieldofITcorpusCombinationclassifier目录摘要..................................................................................................................................................1Abstract.............................................................................................................................................2第一章:引言...................................................................................................................................41.1研究背景和意义................................................................................................................41.2国内外发展现状................................................................................................................51.3本文的研究内容................................................................................................................5第二章:文本分类相关技术...........................................................................................................62.1文本分类的一般过程........................................................................................................62.2文本表示............................................................................................................................72.2.1文本预处理..............................................................................................................72.2.2文本表示模型.........................................................................................................82.3常用文本分类算法............................................................................................................92.3.1朴素贝叶斯算法(NaiveBayes算法)................................................................92.3.2支持向量机(SVM)..........................................................................................102.3.3K近邻分类器(KNN).......................................................................................11第三章:IT领域文本分类模型....................................................................................................123.1IT领域文本分类模型......................................................................................................123.2IT语料库的设计..............................................................................................................123.3特征权重的确定..............................................................................................................153.4朴素贝叶斯与支持向量机的组合分类器......................................................................16第四章:文本分类系统实验及结果分析.....................................................................................184.1系统共分为四个模块......................................................................................................184.2分类过程是:..................................................................................................................194.3检验SVM与NB方法对IT语料库的分类效果实验结果..........................................194.4组合分类器实验结果......................................................................................................20第五章:总结与展望.....................................................................................................................235.1总结.................................................................................................................