文本分类与聚类

整理文档很辛苦,赏杯茶钱您下走!

免费阅读已结束,点击下载阅读编辑剩下 ...

阅读已结束,您可以下载文档离线阅读编辑

资源描述

第1/92页„„„KNN„K第2/92页第3/92页„„„„第4/92页„:„,x∈X,X„:C={c1,c2,…cn}„„:„xc(x)∈C,c(x)XC第5/92页„„2(binary)„(multi-class)2„(multi-label)„„„„:Reuters第6/92页ABCDEFGHIJKNOPQRSUVXTBTDTETFTGTHTJTKTLTMTNTPTQTSTUTV第7/92页第8/92页MultimediaGUIGarb.Coll.SemanticsMLPlanningplanningtemporalreasoningplanlanguage...programmingsemanticslanguageproof...learningintelligencealgorithmreinforcementnetwork...garbagecollectionmemoryoptimizationregion...“planninglanguageproofintelligence”(AI)(Programming)(HCI)......第9/92页„„„„„F1„第10/92页„ContingencyTable„„(precision)=a/(a+b)„(recall)=a/(a+c)„fallout=b/(b+d)ABCD第11/92页BEPF„BEPbreak-evenpoint„BEP„F=1„BEPF„BEPF1p=rBEPF1()()rpprrpFβ++=221,ββrpprF+=21第12/92页„macro-averaging„„„micro-averaging„„第13/92页„TREC„CMU,BERKLEY,CORNELL„„„第14/92页„„„863()„„第15/92页„„„„1992„„第16/92页第17/92页第18/92页„„„„第19/92页)()()|()|(EPHPHEPEHP=)()()|(EPEHPEHP∧=)()()|(HPEHPHEP∧=)()|()(HPHEPEHP=∧第20/92页„{c1,c2,…cn}„E„E„P(E))()|()()|(EPcEPcPEcPiii=∑∑====niiiniiEPcEPcPEcP111)()|()()|(∑==niiicEPcPEP1)|()()(第21/92页(cont.)„:„:P(ci)„:P(E|ci)„P(ci)„Dcini„P(ci)=ni/|D|„:„P(E|ci)meeeE∧∧∧=21第22/92页„„P(ej|ci)„)|()|()|(121∏==∧∧∧=mjijimicePceeePcEP第23/92页NaïveBayes()VDci∈CDiDCiP(ci)=|Di|/|D|niDiwj∈VnijDiwijP(wi|ci)=(nij+1)/(ni+|V|)第24/92页NaïveBayes()„X„nX„:„wiXi)|()(argmax1∏=∈niiiiCiccwPcP第25/92页NaïveBayes„C={allergy,cold,well}„e1=sneeze;e2=cough;e3=fever„E={sneeze,cough,¬fever}ProbWellColdAllergyP(ci)0.90.050.05P(sneeze|ci)0.10.90.9P(cough|ci)0.10.80.7P(fever|ci)0.010.70.4第26/92页NaïveBayes(cont.)„„P(well|E)=(0.9)(0.1)(0.1)(0.99)/P(E)=0.0089/P(E)„P(cold|E)=(0.05)(0.9)(0.8)(0.3)/P(E)=0.01/P(E)„P(allergy|E)=(0.05)(0.9)(0.7)(0.6)/P(E)=0.019/P(E)„:allergy„P(E)=0.089+0.01+0.019=0.0379„P(well|E)=0.23„P(cold|E)=0.26„P(allergy|E)=0.50第27/92页Play-tennis:P(xi|C)OutlookTemperatureHumidityWindyClasssunnyhothighfalseNsunnyhothightrueNovercasthothighfalsePrainmildhighfalsePraincoolnormalfalsePraincoolnormaltrueNovercastcoolnormaltruePsunnymildhighfalseNsunnycoolnormalfalsePrainmildnormalfalsePsunnymildnormaltruePovercastmildhightruePovercasthotnormalfalsePrainmildhightrueNP(p)=9/14P(n)=5/14第28/92页outlookP(sunny|p)=2/9P(sunny|n)=3/5P(overcast|p)=4/9P(overcast|n)=0P(rain|p)=3/9P(rain|n)=2/5temperatureP(hot|p)=2/9P(hot|n)=2/5P(mild|p)=4/9P(mild|n)=2/5P(cool|p)=3/9P(cool|n)=1/5humidityP(high|p)=3/9P(high|n)=4/5P(normal|p)=6/9P(normal|n)=2/5windyP(true|p)=3/9P(true|n)=3/5P(false|p)=6/9P(false|n)=2/5第29/92页Play-tennis:X„X=rain,hot,high,false„P(X|p)·P(p)=P(rain|p)·P(hot|p)·P(high|p)·P(false|p)·P(p)=3/9·2/9·3/9·6/9·9/14=0.010582„P(X|n)·P(n)=P(rain|n)·P(hot|n)·P(high|n)·P(false|n)·P(n)=2/5·2/5·4/5·2/5·5/14=0.018286„Xn“”第30/92页„Joachims(1996)„2010002„2/31/3„205%89%第31/92页„„„第32/92页K第33/92页KKNN„„xx1xx1„-KNN„KNN第34/92页KNN„Xy„yx„kA,AX„n1,n2c1,c2„p(c1|y)p(c2|y),c1,c2()(,)MAXxNsimyMAXsimxy∈=max{|(,)()}AxNsimxysimy=∈=11(|)12npcynn=+22(|)12npcynn=+第35/92页kNN„k=1,Ak=4Bk=10Bk35第36/92页KNN第37/92页KNN„„„.„kk第38/92页„().„m.„m.„tf/idf.第39/92页KNN„K„K15„„„„„KNNKNN第40/92页KNNNB„KNNNB„KNNNBNB第41/92页第42/92页„CLSID3C4.5CARTAssistant„„第43/92页„„第44/92页第45/92页()OutlookSunnyHumidityNormal=∩=()OutlookOvercast∪=()OutlookRainWindWeak∪=∩=第46/92页„-„„„„第47/92页„„„„„NP-第48/92页„„第49/92页第50/92页„()(,)()()vvvValueASGainSAEntropySEntropySS∈=−∑第51/92页OutlookTemperatureHumidityWindyClasssunnyhothighfalseNsunnyhothightrueNovercasthothighfalsePrainmildhighfalsePraincoolnormalfalsePraincoolnormaltrueNovercastcoolnormaltruePsunnymildhighfalseNsunnycoolnormalfalsePrainmildnormalfalsePsunnymildnormaltruePovercastmildhightruePovercasthotnormalfalsePrainmildhightrueN第52/92页(),ValuesWindWeakStrong=[9,5]S=+−[6,2]WeakS←+−[3,3]StrongS←+−{,}(,)()vvWeakStrongSGainSWindEntroySEntropyS∈=−∑()(8/14)()(6/14)()WeakStrongEntropySEntropySEntropyS=−−0.949(8/14)0.811(6/14)1.00=−−0.048=第53/92页S:[9+,5-]E=0.940Humidity3+4-E=0.9856+,1-E=0.592Gain(S,Humidity)=0.940-(7/14)0.985-(7/14)0.592S:[9+,5-]E=0.940Wind6+2-E=0.8113+3-E=1.000Gain(S,Wind)=0.940-(8/14)0.811-(6/14)0.100highnormalstrongweak第54/92页„„Gain(S,Outlook)=0.246„Gain(S,Humidity)=0.151„Gain(S,Wind)=0.048„Gain(S,Temperature)=0.029„Outlook第55/92页D1,D2,…D149+,5-OutlookSunnyD1,D2,D8,D9,D112+,3-RainD4,D5,D6,D10,D143+,2-D3,D7,D12,D134+,0-Overcast?Ssunny={D1,D2,D8,D9,D11}GainSsunny,Humidity=0.970-(3/5)0.0-(2/5)0.0=0.970Gain(Ssunny,Temperature)=0.970-(2/5)0.0-(2/5)1.0-(1/5)0.0=0.570Gain(Ssunny,Wind)=0.970-(2/5)1.0-(3/5)0.918=0.019?Yes第56/92页ID3RootAÅAttributesRootÅAviRootA=viExamplesviExamplesAviExamplesvilable=Examples(target-attribute)ID3(examplevi,target-attribute,attributes-{A})Root第57/92页C4.5„C4.5ID3„„„„第58/92页„„„overfitting第59/92页„„forwardpruning„backwardpruning„„„第60/92页„„„„„第61/92页„„„„第62/92页„„{,,,…}„„Yahoo„„„„{spam,not-spam}第63/92页TextClustering第64/92页第65/92页第66/92页„„„„„:„„„第67/92页................................第68/92页„„.animalvertebratefishreptileamphib.mammalworminsectcrustaceaninvertebrate第69/92页vs.„(bottom-up)„(partitional,top-down)„„第70/92页(HAC)„„„„cicj„ci∪cjcicj„第71/92页:d1d2d3d4d5d1,d2d4,d5d3d3,d4,d5第72/92页„„„„SingleLink:„CompleteLink:„GroupAverage:第73/92页„ci,cj„SingleLink:

1 / 92
下载文档,编辑使用

©2015-2020 m.777doc.com 三七文档.

备案号:鲁ICP备2024069028号-1 客服联系 QQ:2149211541

×
保存成功