第33卷第10期重庆大学学报Vol.33No.102010年10月JournalofChongqingUniversityOct.2010:1000582X(2010)1011008冯永,贺迅,唐黎,陈显勇,陈贞(重庆大学计算机学院,重庆400044):20090510:(2008BB2183);(DJIR10180006);211(S10218);(20080440699);(2008BAH37B04);(ACA0700408):(1977),,,,(Tel)13983980003;(Email)fengyong@cquedu.cn:针对传统字典匹配分词法在识别新词和特殊词处理方面的不足,结合2元统计模型提出了面向文本知识管理的自适应中文分词算法!!!SACWSASACWSA在预处理阶段结合应用有限状态机理论基于连词的分隔方法和分治策略对输入文本进行子句划分,从而有效降低了分词算法的复杂度;在分词阶段应用2元统计模型,结合局部概率和全局概率,完成子句的切分,从而有效地提升了新词的识别率并消除了歧义;在后处理阶段,通过建立词性搭配规则来进一步消除2元分词结果的歧义SACWSA主要的特色在于利用分而治之的思想来处理长句和长词,用局部概率与全局概率相结合来识别生词和消歧通过在不同领域语料库的实验表明,SACWSA能准确高效地自动适应不同行业领域的文本知识管理要求:知识管理;文本处理;统计方法;自适应算法:TP182:ATextknowledgemanagementorientedadaptiveChinesewordsegmentationalgorithmsFENGYong,HEXun,TANGLi,CHENXianyong,CHENZhen(CollegeofComputerScience,ChongqingUniversity,Chongqing400044,P.R.China)Abstract:Toovercometheshortcomingsofnewwordrecognitionandspecialwordprocessingforthetraditionaldictionarybasedmatchingalgorithmin,textknowledgemanagementorientedadaptiveChinesewordsegmentationalgorithm(SACWSA)basedon2gramstatisticalmodelispresented..Atthepreprocessingstage,SACWSAappliesfinitestatemachinetheory,conjunctionbasedpartitionmethodanddivideconquerstrategytopartitionlongsentencesininputtextintosubsentences,whichreducesthealgorithmcomplexityeffectively.Atthewordsegmentationstage,2gramstatisticalmodelisemployedandcombinedwithpartialprobabilityandoverallprobabilitytopartitionthesubsentencesintowords,whichimprovedtherecognitionrateofnewwordsandeliminatedambiguity.Atthepostprocessingstage,thematchingrulesofpartofspeechareestablishedtoeliminateambiguityof2gramwordsegmentationresultsfurther.TheinnovationsofSACWSAincludedealingwiththelongsentencesandlongtermswiththeideaof∀DivideandConquer∀;whilecombiningthepartialprobabilityandoverallprobabilitytoidentifynewwordsandeliminateambiguity.ExperimentalresultsontextcorpusofdifferentfieldsshowthatSACWSAcanadapttodifferenttextknowledgemanagementrequirementsindifferentfieldsaccurately,efficientlyandautomatically.Keywords:knowtledegmanagement;textprocessing;statisticalmethods;adaptivealgorithms[1],,()()[23],,[4],,1,3[5]:1.1[67],,,,;,,,;,;,,,,,1.2W,S,,S,WBayes,,,,[89][1011]:,1.3,,,[1216]22.11),/2),,,3),,,2.2N(Ngram)N:111,,1w1w2wkP(W),P(W)=P(w1w2...wk)=P(w1)P(w2|w1)P(w3|w1w2)P(wn|w1w2wn-1)=#ni=1P(wi|w1w2...wi-1)(1),kk-1wk,,,Ngramw1w2wk,P(W)wkn-1,N,111第10期冯永,等:面向文本知识管理的自适应中文分词算法P(W)∃P(w1)P(w2|w1)#ki=3P(wi|wi-2wi-1)(2)P(wi|wi-2wi-1)∃count(wi-2wi-1wi)count(wi-2wi-1)(3),count(L)L2,P(W)∃#ki=1P(wi|wi-1)(4)2gram22.3NgramNgram,wk,,,1)L(wk,w1w2wk-1),P(wk|w1w2wk-1),,,2)NgramN,Ngram,P(W),,[12],30,4,Ngram3)Ngram,,3SACWSA,SACWSA(selfadaptivechinesewordsegmentationalgorithm),,3.1,2gram,,3.2,,311),1,,FSA,Sogou(SOHU),(30,),3%1.4%292010:/9//2010/2),,,2%112重庆大学学报第33卷,{,,,,,,},//////3)2,,t(k),,33,/23.32gram,2gram,:s=s1s2sn;si=ci1ci2cijcij;:Step1:HashStep2:Step3::siwi1wi2wiks()454A:2gram()5B:3.41)&∋199812gram,,113第10期冯永,等:面向文本知识管理的自适应中文分词算法,55000,12000,46000026,ntsfmqbrvazdpcuyeoiljhkgxw,(nrnsntnz),3922),P(W)0,P(W),,Adddelta[17],3)si=ci1ci2cij,c1c2cn,22,,2,,38,,{c1c2,c2c3,c3c4,,cn-1cn},P(c1c2)P(c2c3)c1c2c2c3,,24),,,,s,,,,+,2,,,(1)(2),,/;,2,22,,32,,,///,,,,,,,3.52gram2gram12,,2,::,,2gram///(a+a,+),+,,:///4CPUIntelCore2Duo1.80GHZ,1G,WindowsServer2003SOGOU9,,IT,,,,,,,,1990&∋19981,3(3,)20,4.11)SOGOU:,(PCMA)114重庆大学学报第33卷,,,,://////(PCMA)//////////////////////////////////////////////////////2):,,,,:///////////////////////////////////////////////////////////////////():///////////////////////////////////////////////////////////////////4.2,,,,ICTCLAS,,,20Sogou218,(P)=;(R)=28,8,1P(Sogou)R(Sogou)P()R()SACWSA0.960.9560.970.968ICTCLAS0.960.950.960.952SACWSASogouICTCLAS,ICTCLAS97%,4.33,203000~50001),,((((((,,,2).&∋()200811,((,1,,115第10期冯永,等:面向文本知识管理的自适应中文分词算法213)((,300/t,100/t((,,,:(Poov)=;(Roov)=,22Poov(SACWSA)Roov(SACWSA)Poov(ICTCLAS)Roov(ICTCLAS)0.980.960.440.410.980.950.540.500.960.970.480.36:ICTCLAS,SACWSA,3,,24.4,,Internet,C,JavaVB,,ICTCLAS800k/s~2M/s,500k/s~1M/s,JAVA,300~700k/s,300k/s1)Ngram,,P(W),,SACWSA,,,2,,SOGOU,66:SACWSA,,,,,,,2)3()/%202752048520895.5242772448624894.528276.52849028895,,,,116重庆大学学报第33卷,,,,352!SACWSA,,,3,,:[1]GAOJF,WUAD,LIM.AdaptiveChinesewordsegmentation[C]//Proceedingsofthe42ndAnnualMeetingonAssociationforComputationalLinguistics.[s.l.]:ACL2004,2004:462469.[2]ZHANGMY,LUZD,ZOUCY.AChinesewordsegmentationbasedonlanguagesituationinprocessingambiguouswords[J].InformationSciences,2004,162(34):275285.[3]WANGXJ,QINY,LIUW.AsearchbasedChinesewordsegmentationmethod[C].Proceedingsofthe16thInternationalWorldWideWebConference,2007:11291130.[4]WANGHS,CUIMM.AChinesewordsegmentationbasedonmachinelearning[C]//Proceedingsofthe1stInternationalWorkshoponEducationTechnologyandComputerScience.[S.L.]ETCS2009,2009,2:610613.[5]HONGCM,CHENCM,CHIUCY.AutomaticextractionofnewwordsbasedonGoogleNewscorporaforsupportinglexiconbasedChinesewordsegmentationsystems[J].ExpertSystemswithApplications,2009,36(2):36413651.[6]ZENGD,WEIDH,CHAUM,etal.Chinesewordsegmentationforterrorismrelatedcontents[J].LectureNotesinComputerSc