密级:学校代码:10075分类号:学号:20091328工学硕士学位论文〇ClassifiedIndex:CODE:10075U.D.C:NO:20091328ADissertationfortheDegreeofM.EngineeringResearchonIdentifyingReviewSpamforProductReviewsCandidate:LiuLijiaSupervisor:Prof.YuanFangAcademicDegreeApplied:MasterofEngineeringSpecialty:ComputerAppliedTechnologyUniversity:HebeiUniversityDateofOralExamination:June,2012摘要I摘要近年来,随着因特网的快速发展,人们发表观点以及相互交流的方式也发生了改变。在产品评论领域,人们越来越喜欢在购物网站上发表自己对产品所持有的观点。这些由用户发表的观点中包含着丰富的有用的信息。同时,在这些观点中也充斥着一些无用的、不真实的垃圾信息。这些垃圾信息的存在影响了产品评论挖掘的质量。本文面向中文产品评论领域,对垃圾评论识别进行了研究,主要工作如下:首先,通过对中文产品评论领域的垃圾评论进行分析,将垃圾评论分成无用评论和不真实评论两大类别,并根据其特点的不同,提出了不同的识别方法。针对无用评论的识别,将其看成是二元分类问题。使用产品特征词、对非产品信息评价语句、问句以及超链接4个重要的分类特征,同时又结合信息增益方法自动抽取出一部分特征来共同表示评论文本。昀后由这些特征构成的特征值将评论文本向量化,再采用基于Logistic回归的分类方法将评论文本分为正常评论和无用评论两大类来完成对无用评论的识别。针对不真实评论的识别,考虑了词与词之间的次序问题,并采用2-gram模型来表示评论文本。在构建语言模型的同时,为了避免出现概率值为零的情况,采用Katz平滑方法对模型进行平滑,昀后计算每对语言模型的KL散度,如果其值小于某一给定的阈值,则认为是不真实的评论。实验结果表明,本文提出的方法能够有效地识别产品评论中存在的无用评论和不真实评论。关键词垃圾评论Logistic回归2-gram模型Katz平滑KL散度AbstractIIAbstractInrecentyears,withtherapiddevelopmentoftheInternet,thewayofexpressionandcommunicationofpeoplehasalsochanged.Inthefieldofproductreviews,Peoplearemoreinclinedtoexpressthemselvesonsuchonlineshopping.Thoseexpressionsoftheusersarerichinvariedandusefulinformation.Meanwhilethoseexpressionsmayalsoincludesomespaminformation.Thespaminformationhasaffectedthequalityoftheproductreviewsmining.ThispapercomesupwithanidentificationwayofthespamintheChineseproductreviews.Themainworksareasfollows:First,basedontheanalysisofspamreviewsintheChineseproductreviews,spamreviewsareclassifiedintouselessreviewsanduntruthfulreviews.Differentmethodsofdetectionareproposedaccordingtotheirfeatures.Astothedetectionofuselessreviews,thispapertakesitasbinaryclassificationproblem.Weusefourimportantclassificationfeaturessuchasproductfeatures,assessingphrasesaboutnon-productinformation,questionsandhyperlinkstocharacterizereviews,meanwhileweuseinformationgainmethodtoextractsomefeaturesautomaticallytocharacterizereviewstogetherwiththeotherfourfeatures.Atlast,converteachreviewexpressintothefeaturevectorformatcomposedbythesefeaturevalues,andthenadopttheclassificationmethodbasedontheLogisticregressiontoclassifythereviewsintonormalreviewsanduselessreviews,whichfinishthedetectionoftheuselessreviews.Astothedetectionofuntruthfulreviews,2-grammodelisusedtoexpressthereviewtextswiththeconsiderationofthewordorder,inordertoavoidthesituationthattheprobabilityvalueiszerowhenconstructingthe2-grammodel,theKatzsmoothingmethodisadoptedtosmooththemodel,andlastlytheKLdivergenceisaddedtodetecttheuntruthfulreviews.IfthevalueofKLdivergenceislessthanagiventhreshold,wearguethatthereviewisnottrue.Theexperimentsresultshasillustratedthatthosemethodsputforwardinthispapercaneffectivelyidentifyuselessreviewsanduntruthfulreviewsexistintheproductreviews.KeywordsSpamDetectionLogisticRegression2-gramModelKatzSmoothingKLdivergence目录III目录第1章绪论..........................................................................................................................11.1研究背景和意义........................................................................................................11.2研究现状及分析........................................................................................................21.2.1垃圾评识别论研究现状..................................................................................21.2.2垃圾评论识别存在的主要问题......................................................................41.3主要研究内容与论文组织结构................................................................................51.3.1主要研究内容..................................................................................................51.3.2论文组织结构..................................................................................................61.4本章小结....................................................................................................................6第2章相关知识....................................................................................................................72.1产品评论领域评论的特点........................................................................................72.2产品评论领域垃圾评论的定义................................................................................82.3特征抽取方法............................................................................................................92.4语言模型..................................................................................................................102.4.1基本概念........................................................................................................102.4.2语言模型平滑方法........................................................................................112.5相似度方法介绍......................................................................................................122.6本章小结..................................................................................................................13第3章基于信息增益和Logistic回归的无用评论识别...................................................143.1分类特征的构建....................................