应用scikit-learn做文本分类分类:DataMiningMachineLearningPython2014-04-1320:5312438人阅读评论(16)收藏举报20newsgroups文本挖掘Pythonscikitscipy文本挖掘的paper没找到统一的benchmark,只好自己跑程序,走过路过的前辈如果知道20newsgroups或者其它好用的公共数据集的分类(最好要所有类分类结果,全部或取部分特征无所谓)麻烦留言告知下现在的benchmark,万谢!嗯,说正文。20newsgroups官网上给出了3个数据集,这里我们用最原始的20news-19997.tar.gz。分为以下几个过程:加载数据集提feature分类oNaiveBayesoKNNoSVM聚类说明:scipy官网上有参考,但是看着有点乱,而且有bug。本文中我们分块来看。Environment:Python2.7+Scipy(scikit-learn)1.加载数据集从20news-19997.tar.gz下载数据集,解压到scikit_learn_data文件夹下,加载数据,详见code注释。[python]viewplaincopy1.#firstextractthe20news_groupdatasetto/scikit_learn_data2.fromsklearn.datasetsimportfetch_20newsgroups3.#allcategories4.#newsgroup_train=fetch_20newsgroups(subset='train')5.#partcategories6.categories=['comp.graphics',7.'comp.os.ms-windows.misc',8.'comp.sys.ibm.pc.hardware',9.'comp.sys.mac.hardware',10.'comp.windows.x'];11.newsgroup_train=fetch_20newsgroups(subset='train',categories=categories);可以检验是否load好了:[python]viewplaincopy1.#printcategorynames2.frompprintimportpprint3.pprint(list(newsgroup_train.target_names))结果:['comp.graphics','comp.os.ms-windows.misc','comp.sys.ibm.pc.hardware','comp.sys.mac.hardware','comp.windows.x']2.提feature:刚才load进来的newsgroup_train就是一篇篇document,我们要从中提取feature,即词频啊神马的,用fit_transformMethod1.HashingVectorizer,规定feature个数[python]viewplaincopy1.#newsgroup_train.dataistheoriginaldocuments,butweneedtoextractthe2.#featurevectorsinordertomodelthetextdata3.fromsklearn.feature_extraction.textimportHashingVectorizer4.vectorizer=HashingVectorizer(stop_words='english',non_negative=True,5.n_features=10000)6.fea_train=vectorizer.fit_transform(newsgroup_train.data)7.fea_test=vectorizer.fit_transform(newsgroups_test.data);8.9.10.#returnfeaturevector'fea_train'[n_samples,n_features]11.print'Sizeoffea_train:'+repr(fea_train.shape)12.print'Sizeoffea_train:'+repr(fea_test.shape)13.#11314documents,130107vectorsforallcategories14.print'Theaveragefeaturesparsityis{0:.3f}%'.format(15.fea_train.nnz/float(fea_train.shape[0]*fea_train.shape[1])*100);结果:Sizeoffea_train:(2936,10000)Sizeoffea_train:(1955,10000)Theaveragefeaturesparsityis1.002%因为我们只取了10000个词,即10000维feature,稀疏度还不算低。而实际上用TfidfVectorizer统计可得到上万维的feature,我统计的全部样本是13w多维,就是一个相当稀疏的矩阵了。**************************************************************************************************************************上面代码注释说TF-IDF在train和test上提取的feature维度不同,那么怎么让它们相同呢?有两种方法:Method2.CountVectorizer+TfidfTransformer让两个CountVectorizer共享vocabulary:[python]viewplaincopy1.#----------------------------------------------------2.#method1:CountVectorizer+TfidfTransformer3.print'*************************\nCountVectorizer+TfidfTransformer\n*************************'4.fromsklearn.feature_extraction.textimportCountVectorizer,TfidfTransformer5.count_v1=CountVectorizer(stop_words='english',max_df=0.5);6.counts_train=count_v1.fit_transform(newsgroup_train.data);7.printtheshapeoftrainis+repr(counts_train.shape)8.9.count_v2=CountVectorizer(vocabulary=count_v1.vocabulary_);10.counts_test=count_v2.fit_transform(newsgroups_test.data);11.printtheshapeoftestis+repr(counts_test.shape)12.13.tfidftransformer=TfidfTransformer();14.15.tfidf_train=tfidftransformer.fit(counts_train).transform(counts_train);16.tfidf_test=tfidftransformer.fit(counts_test).transform(counts_test);结果:*************************CountVectorizer+TfidfTransformer*************************theshapeoftrainis(2936,66433)theshapeoftestis(1955,66433)Method3.TfidfVectorizer让两个TfidfVectorizer共享vocabulary:[python]viewplaincopy1.#method2:TfidfVectorizer2.print'*************************\nTfidfVectorizer\n*************************'3.fromsklearn.feature_extraction.textimportTfidfVectorizer4.tv=TfidfVectorizer(sublinear_tf=True,5.max_df=0.5,6.stop_words='english');7.tfidf_train_2=tv.fit_transform(newsgroup_train.data);8.tv2=TfidfVectorizer(vocabulary=tv.vocabulary_);9.tfidf_test_2=tv2.fit_transform(newsgroups_test.data);10.printtheshapeoftrainis+repr(tfidf_train_2.shape)11.printtheshapeoftestis+repr(tfidf_test_2.shape)12.analyze=tv.build_analyzer()13.tv.get_feature_names()#statisticalfeatures/terms结果:*************************TfidfVectorizer*************************theshapeoftrainis(2936,66433)theshapeoftestis(1955,66433)此外,还有sklearn里封装好的抓feature函数,fetch_20newsgroups_vectorizedMethod4.fetch_20newsgroups_vectorized但是这种方法不能挑出几个类的feature,只能全部20个类的feature全部弄出来:[python]viewplaincopy1.print'*************************\nfetch_20newsgroups_vectorized\n*************************'2.fromsklearn.datasetsimportfetch_20newsgroups_vectorized3.tfidf_train_3=fetch_20newsgroups_vectorized(subset='train');4.tfidf_test_3=fetch_20newsgroups_vectorized(subset='test');5.printtheshapeoftrainis+repr(tfidf_train_3.data.shape)6.printtheshapeoftestis+repr(tfidf_test_3.data.shape)结果:*************************fetch_20newsgroups_vectorized*************************theshapeoftrainis(11314,130107)theshapeoftestis(7532,130107)3.分类3.1MultinomialNaiveBayesClassifier见代码&comment,不解释[python]viewplaincopy1.######################################################2.#Multino