1文本情感分析技术杨建武Email:yangjw@pku.edu.cn第十二章:北京大学计算机科学技术研究所文本挖掘技术(2012春)2情感计算的概念情感计算(AffectiveComputing)通过计算机技术,自动分析文本、图像或视音频等对象所包含的情感倾向及其强度•例如:正面或负面、喜欢或讨厌、快乐或悲伤、愤怒和恐惧等情感计算的分类•主观性(Subjectivity)–主观性、客观性和中性•情感倾向(Orientation)–正面(褒义)、负面(贬义)和中性情感计算的应用81%ofInternetusers(or60%ofAmericans)havedoneonlineresearchonaproductatleastonce;Amongreadersofonlinereviewsofrestaurants,hotels,andvariousservices(e.g.,travelagenciesordoctors),between73%and87%reportthatreviewshadasignificantinfluenceontheirpurchase;Consumersreportbeingwillingtopayfrom20%to99%morefora5-star-rateditemthana4-star-rateditem(thevariancestemsfromwhattypeofitemorserviceisconsidered);34情感计算的应用Businessesandorganizations:productandservicebenchmarking.Marketintelligence.Businessspendsahugeamountofmoneytofindconsumersentimentsandopinions.•Consultants,surveysandfocusedgroups,etcIndividuals:interestedinother‟sopinionswhenPurchasingaproductorusingaservice,Findingopinionsonpoliticaltopics,Adsplacements:Placingadsintheuser-generatedcontentPlaceanadwhenonepraisesaproduct.Placeanadfromacompetitorifonecriticizesaproduct.Opinionretrieval/search:providinggeneralsearchforopinions.ChallengesDeterminewhetheradocumentorportion(e.g.paragraphorstatement)issubjective.Example:“thebatterylasts2hours”vs.“thebatteryonlylasts2hours”5ChallengesThedifficultyliesintherichnessofhumanlanguageuse.Example:1.Thisisagreatcamera.2.Agreatamountofmoneywasspentforpromotingthiscamera.3.Onemightthinkthisisagreatcamera.Wellthinkagain,because.....asinglekeywordcanbeusedtoconveythreedifferentopinions,+ve,neutraland-verespectively.67文本情感计算词或短语的情感倾向文档与句子的情感倾向观点挖掘基于特征的观点挖掘比较式观点挖掘8词语的情感倾向OpinionWordsorPhrases(alsocalledpolarwords,opinionbearingwords,etc).E.g.,Positive:beautiful,wonderful,good,amazingNegative:bad,poor,terribleImportanttonote:Someopinionwordsarecontextindependent(e.g.,good).Somearecontextdependent(e.g.,long).Threemainwaystocompilesuchalist:Manualapproach:notabadidea,onlyanone-timeeffortCorpus-basedapproachesDictionary-basedapproaches9词语的情感倾向1997年,Hatzivassiloglou等人通过连词的语义约束计算形容词的情感倾向2002年,Turney等人提出利用搜索引擎查询词之间的互信息(PMI):AltaVista的Near操作符,“excellent”and“poor”2003年,Turney等人又提出基于潜在语义分析(LSA)计算词语的语义倾向2004年,Kamps等人提出基于WordNet的方法,通过计算词与“good”and“bad”之间的语义距离来作为分类标注词语的情感倾向nicehandsometerriblecomfortablepainfulexpensivefunscenicslow11SO-PMIMeasuringPraiseandCriticism:InferenceofSemanticOrientationfromAssociation(TURNEY2003)SO-PMI(SemanticOrientationfromPointwiseMutualInformation)12SO-PMI13基于词典的方法TypicallyuseWordNet‟ssynsetsandhierarchiestoacquireopinionwordsStartwithasmallseedsetofopinionwords.UsethesettosearchforsynonymsandantonymsinWordNet(HuandLiu,KDD-04;KimandHovy,COLING-04).Manualinspectionmaybeusedafterward.Useadditionalinformation(e.g.,glosses注释)fromWordNetandlearning(AndreevskaiaandBergler,EACL-06)(EsutiandSebastiani,CIKM-05)14基于词典的方法WeaknessoftheapproachDonotfindcontextdependentopinionwords,•e.g.,small,long,fast.中文资源:HowNet、同义词词林15文章的情感倾向分析Classifydocuments(e.g.,reviews)basedontheoverallsentimentsexpressedbyopinionholders(authors),Positive,negative,and(possibly)neutralSimilarbutdifferentfromtopic-basedtextclassification.Intopic-basedtextclassification,topicwordsareimportant.Insentimentclassification,sentimentwordsaremoreimportant,e.g.,great,excellent,horrible,bad,worst,etc.16文章的情感倾向分析2003年,Turney用评论中出现的词语的倾向的平均值来代表整篇评论的倾向;2003年,Dave等用词的倾向代表文章的倾向,考虑了词的倾向强度;2002年,BoPang等人首先在情感分析领域引入了机器学习的方法,利用NaïveBayes、MaxEntropy、SVM等分类,在文档级别上对文档进行自动的情感分类;(作者通过IMDB收集了具有标注的电影评论)2004年,BoPang等人又提出通过机器学习和图中最小割的方法对文档中的句子进行主观性判断;2005年,BoPang等人进一步拓展了他们的工作,通过机器学习的方法对电影评论进行3级或4级打分。17Unsupervisedreviewclassification(Turney,ACL-02)Data:reviewsfromepinions.comonautomobiles,banks,movies,andtraveldestinations.Theapproach:ThreestepsStep1:Part-of-speechtaggingExtractingtwoconsecutivewords(two-wordphrases)fromreviewsiftheirtagsconformtosomegivenpatterns,e.g.,(1)JJ,(2)NN.18UnsupervisedreviewclassificationStep2:Estimatethesemanticorientation(SO)oftheextractedphrasesUsePointwisemutualinformationSemanticorientation(SO):UsingAltaVistanearoperatortodosearchtofindthenumberofhitstocomputePMIandSO.19UnsupervisedreviewclassificationStep3:ComputetheaverageSOofallphrasesclassifythereviewasrecommendedifaverageSOispositive,notrecommendedotherwise.Finalclassificationaccuracy:automobiles-84%banks-80%movies-65.83traveldestinations-70.53%20Sentimentclassificationusingmachinelearningmethods(Pangetal,EMNLP-02)Thispaperdirectlyappliedseveralmachinelearningtechniquestoclassifymoviereviewsintopositiveandnegative.Threeclassificationtechniquesweretried:NaïveBayesMaximumentropySupportvectormachinePre-processingsettings:negationtag,unigram(singlewords),bigram,POStag,position.SVM:thebestaccuracy83%(unigram)21句子级情感倾向分析Document-levelsentimentclassificationistoocoarse(粗糙)formostapplications.Muchoftheworkonsentencelevelsentimentanalysisfocusesonidentifyingsubjectivesentencesinnewsarticles.Classification:objectiveandsubjective.Alltechniquesusesomeformsofmachinelearning.E.g.