上海交通大学硕士学位论文基于贝叶斯理论的数据挖掘方法在电子邮件分类中的应用研究姓名:李少猷申请学位级别:硕士专业:管理科学与工程指导教师:沈惠璋20070101-1--2--3-RESEARCHONPRICINGSTRATEGYINDIFFERENTCOLLECTIONCHANNELSINCLOSED-LOOPSUPPLYCHAINSWITHPRODUCTREMANUFACTURINGABSTRACTWiththehumansocietystepsintotheinformationera,theemailhastakenmoreandmoreimportantroleinourbusinessandlife.Theemailhasbroughtusfast,cheapandconvenientcommunicationchannels,butatthesametime,theemailisbeingusedtotransmittheinformationtosomeonewhodoesnotwanttoreceive.Thiskindofemailistheso-calledjunkmail,orthespam.Thespamhasboostedintherecentyears,andalsobroughtmanytechnicalandsocialproblems.Recently,theproblemshavebecomemoreseverethatpeoplehavetofaceandtackle.Copingwiththechallengesposedbylargespam,alotofanti-spamtechnologiesarise.Anti-spamtechnology,oremailfilteringtechnology,inessence,istheemailclassificationtechnology.Emailonlyfromtheinitialclassificationsystembasedonthesimplerulesofthestaticclassification,andgraduallydevelopedusingdatamining.Thecontentandconductofthespamhasbeenlearned,identifiedandjudged,andthentheclassificationsystemdynamicallygeneratedandadjustedtheemailclassificationrulestoclassifyemailswithintelligence.Emailclassificationintheareaofdataminingistheapplicationofacademicandindustrialresearch.Intheareaofemailclassification,consideringstoragespace,responsespeedandtheangleofcomputationalcomplexity,Bayesianapproachesarethemostimportanttechnologyinthemainstream.Thispaperstudiesfromtheperspectiveofknowledgediscoveryindatabases,startsfromthechoiceoftargetdata,preprocessingofdata,anddatatransformation;thentalksaboutthemodelsandrelationshipswecangetfromdatamining,-4-finallytriestoanalyze,research,explainandassesstheforecastandperformanceofdifferentmodelsbasedonBayesiantheory.ThispaperexaminedthedataminingmethodbasedonBayesiantheoryinconcreteresultsanddetails.First,wetriedtobuildupanemailclassificationmodeltoexplorethebasicassumptionsandclassification.Then,wediscussedtheemailfeatureextractionandselectionmethods,especiallyfocusedondocumentfrequencyandinformationgain.Finally,wecomparedthreedifferentclassificationalgorithmbasedonBayesiantheoryconsideringthedifferentfeatureextractionmethod,criteriafortheimportanceofdifferentfeaturesandthedifferenttypesoffeatures.Wealsoexaminedtheeffectivenessofsupervisedtraining.Throughthisresearch,wecanbuilduptheframeworkofapplicationsystemthatbasedondifferentBayesianapproaches.Wewalkedthroughthemethodsanddetailsoffeatureextractionandselection,supervisedtraining,thedesignofclassifier,performanceevaluation,andfeedbackrelatedtothemachinelearninganddatamining.Whilethisresearchaimedatthisspecialemailclassificationfields,butthetextadoptedbytheapplicationofdatamininghasuniversalapplicability.Classificationmodelscanbewidelyappliedtovariousfields,suchascreditriskassessment,frauddetection,evenappliedtothepriceforecastinthesecuritiesmarket.Inviewofawiderangeofapplicationareas,thispaperprovidesagenerallyapplicabledataminingapplicationframeworkbasedontheBayesianapproach.KEYWORDS:Emailclassification;Machinelearning;Statisticallearning;Datamining;Bayesianmethods2007111-3-1.11.1.1InternetMobileEmailSpam1.1.2-4-1994412Perl6000UsenetPostini200611611390%117020062005112006361.53%63.97%2.440.919.33,20051017.252.0819962004PaulWouters[1]-5-1-11996-2004PaulWoutersFigure1-1thestatisticalanalysisofPaulWouterspersonalspam1.2-6-DataMiningKnowledgediscoveryindatabase,KDDYahooGoogle1.3200610-7-1.3.1InternetServiceProvider,ISPISPISPISPISPIPIPIPBlacklistWhitelistBlockListIPIPIPISPIPIPIPReverseDNSlookupMicrosoftSenderIDIPIPDNSServers-8-KeywordbasedFeaturebasedHTMLHyperlinkMicrosoftOutlookSpamAssassin[2]-9-SpamAssassinSpamAssassin1.3.2Cohen[3]RIPPERClusterAlgorithmDecisionTreesSupportVectorMachinesBayesianMethodsArtificialNeuralNetworkJrennieifile[4]-10-Sahamietal[5]100%SpamPrecision98.3%SpamRecallAndroutsopoulosetal[6]100%99.99%Paul[7]99.75%0.03%GaryRobinson[8]PaulRobinsonSpambayes[9]Bogofilter[10]ShahamiPaulRobinson1.3.3PublicKeyInfrastructure,PKI-11-IBM,Yahoo1.41.4.1-12-1.4.2-13--14-2.12-132-1Figure2-1Classifyemailsintotwocategories-15-2-2Figure2-2Classifyemailsintothreecategories2-22-2-16-2.2123-17-2.3VSMVectorSpaceModelSalton[11]MSLmMt1,t2,…,tnmt1,t2,…,tnmmn(w1,w2,…,wn)(p1,p2,…,pn)w1,w2,…,wnt1,t2,…,tnmw1,w2,…,wnp1,p2,…,pnt1,t2,…,tnp1,p2,…,pnp1,p2,…,pnw1,w2,…,wnt1,t2,…,tnmmwi(i=1,2,…,n)0-1wi10wi=1timwi=0timbinarytitiwi[-11]witimmtimtimwi=0-18-tiwi0timwi0timtiwitiwiTFIDF(ti,m)=N(ti,m)*IDF(ti)[3]2-1TF(ti,m)=N(ti,m)2-2IDF(ti)=log(N/nti)2-3wi=TFIDF(ti,m)(tim)2-4wi=(-1)*IDF(ti)(tim)2-5NMntiMtiN(ti,m)mtiIDFlog(N/nti+0.01)wiIDF(ti)titiTF(ti)mtitim-19-3.13.1.1Google-20-3.1.2RFC822[12]RFC2045[13]MultipurposeInternetMailExtensions(MIME)(header)(body)(MTA,)MTAMUAMSOutlookMTAMTAMUAMUARFC822headerfieldsReturn-Path:BEA_Edu@edmchina.bea.com--Received:frommx.sjtu.edu.cn(mx.sjtu.edu.cn[202.112.26.52])bymail.sjtu.edu.cn(Postfix)withESMTPid6258528AFforlishaoyou@sjtu.edu.cn;Fri,16Dec200520:08:03+0800(BEIST)Received:frommta2.primary.ddc.dartmail.net(mta2.primary.ddc.dartmail.net[146.82.220.232])-21-bymx.sjtu.edu.c