ISSN1000-9825,CODENRUXUEWE-mail:jos@iscas.ac.cnJournalofSoftware,Vol.17,No.9,September2006,pp.1848−1859:+86-10-62562563©2006byJournalofSoftware.Allrightsreserved.∗1,1+,1,21(,410073)2(,410073)AdvancesinMachineLearningBasedTextCategorizationSUJin-Shu1,ZHANGBo-Feng1+,XUXin1,21(SchoolofComputer,NationalUniversityofDefenseTechnology,Changsha410073,China)2(SchoolofMechantronicsEngineeringandAutomation,NationalUniversityofDefenseTechnology,Changsha410073,China)+Correspondingauthor:Phn:+86-731-4513504,E-mail:bfzhang@nudt.edu.cnSuJS,ZhangBF,XuX.Advancesinmachinelearningbasedtextcategorization.JournalofSoftware,2006,17(9):1848−1859.:Inrecentyears,therehavebeenextensivestudiesandrapidprogressesinautomatictextcategorization,whichisoneofthehotspotsandkeytechniquesintheinformationretrievalanddataminingfield.Highlightingthestate-of-artchallengingissuesandresearchtrendsforcontentinformationprocessingofInternetandothercomplexapplications,thispaperpresentsasurveyontheup-to-datedevelopmentintextcategorizationbasedonmachinelearning,includingmodel,algorithmandevaluation.Itispointedoutthatproblemssuchasnonlinearity,skeweddatadistribution,labelingbottleneck,hierarchicalcategorization,scalabilityofalgorithmsandcategorizationofWebpagesarethekeyproblemstothestudyoftextcategorization.Possiblesolutionstotheseproblemsarealsodiscussedrespectively.Finally,somefuturedirectionsofresearcharegiven.Keywords:automatictextcategorization;machinelearning;dimensionalityreduction;kernelmethod;unlabeleddataset;skeweddataset;hierarchicalcategorization;large-scaletextcategorization;Webpagecategorization:,.,.Web,..:;;;;;;;;Web:TP181:A∗SupportedbytheNationalNaturalScienceFoundationofChinaunderGrantNos.90604006,60303012();theNationalResearchFoundationfortheDoctoralProgramofHigherEducationofChinaunderGrantNo.20049998027()Received2005-12-15;Accepted2006-04-03:1849,.,.,(textcategorization,TC),(label),..2090,,,[1].(representation)(effectiveness)3.Sebastiani[1],:(1)(term)(VSM)(selection)(extraction)(dimensionalityreduction),χ2,IG,MI,OR(LSI);(2),(inductiveconstruction);(3),(precision)(recall)(BEP)Fβ(F1)(accuracy),Reuters.,,.[1],.,,.1.2..1,BOW(bagofwords),,.(linguistics),.1.1VSM,.(n-gram),,tf×idf[1].,DeboleSTW,(χ2)[2];[3,4].tf×idf,SVM.VSM,.Bigi,dcP(ti,d)P(ti,c),i=1,…,|T|(T),.Kullback-Leibler(KLD),VSM[5].(normalization)VSM.Nunzio,,,[6];,,.BOW,,.,,,[7,8].,VSM,,.1850JournalofSoftwareVol.17,No.9,September20061.2,,3.,:(1),BNS(bi-normalseparation)[9];(2),(GI)[10];(3),,(differential)LSI[11,12].:(1),,((co-occurrence)[13]),(2[13,14][15]);(2)[16]WordNet[17],.,.[9−14,18,19]:,;,.:,.FormanYang.:BNS,χ2,IG;,[9,13,18,19].[1,10,20].,,.,(PFC)(THR);(MVS)(SPA).Soucy(PCS),4,MVSPCSSPAPFCTHR[21],.1.3,,()[22−24].,(multiple)(ensemblelearning);(SVM)(single).SVM.SVM,[24−28],(over-fitting),(generalization).:,SVM[28−32].[1],,[33,34][35,36]KLD[5][6](SECTILE)[26](1),.Bayesk-NN,,,.Wu,,.,[25];GIS(instanceset),(GI),GI,k-NN,[37];TsayGIS,,,[38];Tan(drag-pushing)Bayes[39];ChakrabartiSIMPLFisher,[24].,,,SVM.:1851,,(decisionoptimization)(coverageoptimization),.,,(MV(majorityvoting),W(weighted)MVWLC(weightedlinearcombination)[1,40]),;Bennett(reliabilityindicator);[41];XuZhangSVMRocchio,Rocchio,SVM,[42];,,(WMV),BaggingBoosting[43];Boosting,,SVM,AdaBoost.MHAdaBoost.MR[44].Table1Propertiesandeffectivenessformostofthecategorizationmodelsormethods1ModelormethodExamplesofalgorithmorImplementationCRHDBiBestrepteff.RemarkProbabilisticNaïveBayes(NB)0.773Easy,highlydependondatadistributionDecisiontree(DT)ID3,C4.5,CART0.794DecisionruleDL-ESC,SCAR,Ripper,Swap-10.823Oftenusedasbase-lines,relativelyweakRegressionLLSF,LR,RR[45]0.849EffectivebutcomputingcostlyOn-LineWinnow,Windrow-Hoff,etc.0.822LinearCentroid-BasedRocchio(andit’senhancements)0.799WeakerbutsimpleandefficientNeuralnetworksPerceptron,Classi,Nnet0.838NotwidelyusedTCInstance-Basedk-NN0.856InefficientinonlineclassificationSVMSVMlight,LibSVM[46,47]0.920StateofartseffectivenessMV,BaggingN/ANotwidelyusedandtestedyetEnsemblelearningWLC,DCS,ACC,adaboost0.878BoostingmethodseffectiveandpopularSTRIVE[41]0.875ComplexinclassifierconstructionEnsemblelearningSVMwithRocchioensemble[42]+0.019**ImprovementinasmallChinesecorpusMaximumentropyLi.KAZAMA[33,34]0.845EffectivebutnotwidelyusedFuzzyLiu,Widyantoro[35,36]0.892**OnlyaccuracyreportedTermprob.distri.KLDbased[5]0.671**BetterthanRocchiointhesametestBidimensionalHeuristicapproach[6]0.871NotextensivelyconfirmedMDandERbasedSECTILE[26]0.950**OnlytestedinaChinesecorpus,estimatedWu’sRefinementRocchio/NBrefined[25]0.9/0.926AlittlecomplexintrainingTsay’srefinementRocchiorefined[38]+0.018**Improvement,aChinesecorpusGener.instancesetGIS-RGIE-W[37]0.860Moreefficientthank-NNintestingDragpushingRCC,RNB[27,39]0.859EasyandcomputationallyefficientLineardiscri.proj.SIMPL[24]0.880**EstimatedformreporteddataLSkernel[48]WithSVM0.903NeedexpensivematrixprocessingWordseq.kernel[49]WithSVM0.915ComplexandtimespendingintrainingStringkernel[