面向文本知识管理的自适应中文分词算法

整理文档很辛苦,赏杯茶钱您下走!

免费阅读已结束,点击下载阅读编辑剩下 ...

阅读已结束,您可以下载文档离线阅读编辑

资源描述

第33卷第10期重庆大学学报Vol.33No.102010年10月JournalofChongqingUniversityOct.2010:1000582X(2010)1011008冯永,贺迅,唐黎,陈显勇,陈贞(重庆大学计算机学院,重庆400044):20090510:(2008BB2183);(DJIR10180006);211(S10218);(20080440699);(2008BAH37B04);(ACA0700408):(1977),,,,(Tel)13983980003;(Email)fengyong@cquedu.cn:针对传统字典匹配分词法在识别新词和特殊词处理方面的不足,结合2元统计模型提出了面向文本知识管理的自适应中文分词算法!!!SACWSASACWSA在预处理阶段结合应用有限状态机理论基于连词的分隔方法和分治策略对输入文本进行子句划分,从而有效降低了分词算法的复杂度;在分词阶段应用2元统计模型,结合局部概率和全局概率,完成子句的切分,从而有效地提升了新词的识别率并消除了歧义;在后处理阶段,通过建立词性搭配规则来进一步消除2元分词结果的歧义SACWSA主要的特色在于利用分而治之的思想来处理长句和长词,用局部概率与全局概率相结合来识别生词和消歧通过在不同领域语料库的实验表明,SACWSA能准确高效地自动适应不同行业领域的文本知识管理要求:知识管理;文本处理;统计方法;自适应算法:TP182:ATextknowledgemanagementorientedadaptiveChinesewordsegmentationalgorithmsFENGYong,HEXun,TANGLi,CHENXianyong,CHENZhen(CollegeofComputerScience,ChongqingUniversity,Chongqing400044,P.R.China)Abstract:Toovercometheshortcomingsofnewwordrecognitionandspecialwordprocessingforthetraditionaldictionarybasedmatchingalgorithmin,textknowledgemanagementorientedadaptiveChinesewordsegmentationalgorithm(SACWSA)basedon2gramstatisticalmodelispresented..Atthepreprocessingstage,SACWSAappliesfinitestatemachinetheory,conjunctionbasedpartitionmethodanddivideconquerstrategytopartitionlongsentencesininputtextintosubsentences,whichreducesthealgorithmcomplexityeffectively.Atthewordsegmentationstage,2gramstatisticalmodelisemployedandcombinedwithpartialprobabilityandoverallprobabilitytopartitionthesubsentencesintowords,whichimprovedtherecognitionrateofnewwordsandeliminatedambiguity.Atthepostprocessingstage,thematchingrulesofpartofspeechareestablishedtoeliminateambiguityof2gramwordsegmentationresultsfurther.TheinnovationsofSACWSAincludedealingwiththelongsentencesandlongtermswiththeideaof∀DivideandConquer∀;whilecombiningthepartialprobabilityandoverallprobabilitytoidentifynewwordsandeliminateambiguity.ExperimentalresultsontextcorpusofdifferentfieldsshowthatSACWSAcanadapttodifferenttextknowledgemanagementrequirementsindifferentfieldsaccurately,efficientlyandautomatically.Keywords:knowtledegmanagement;textprocessing;statisticalmethods;adaptivealgorithms[1],,()()[23],,[4],,1,3[5]:1.1[67],,,,;,,,;,;,,,,,1.2W,S,,S,WBayes,,,,[89][1011]:,1.3,,,[1216]22.11),/2),,,3),,,2.2N(Ngram)N:111,,1w1w2wkP(W),P(W)=P(w1w2...wk)=P(w1)P(w2|w1)P(w3|w1w2)P(wn|w1w2wn-1)=#ni=1P(wi|w1w2...wi-1)(1),kk-1wk,,,Ngramw1w2wk,P(W)wkn-1,N,111第10期冯永,等:面向文本知识管理的自适应中文分词算法P(W)∃P(w1)P(w2|w1)#ki=3P(wi|wi-2wi-1)(2)P(wi|wi-2wi-1)∃count(wi-2wi-1wi)count(wi-2wi-1)(3),count(L)L2,P(W)∃#ki=1P(wi|wi-1)(4)2gram22.3NgramNgram,wk,,,1)L(wk,w1w2wk-1),P(wk|w1w2wk-1),,,2)NgramN,Ngram,P(W),,[12],30,4,Ngram3)Ngram,,3SACWSA,SACWSA(selfadaptivechinesewordsegmentationalgorithm),,3.1,2gram,,3.2,,311),1,,FSA,Sogou(SOHU),(30,),3%1.4%292010:/9//2010/2),,,2%112重庆大学学报第33卷,{,,,,,,},//////3)2,,t(k),,33,/23.32gram,2gram,:s=s1s2sn;si=ci1ci2cijcij;:Step1:HashStep2:Step3::siwi1wi2wiks()454A:2gram()5B:3.41)&∋199812gram,,113第10期冯永,等:面向文本知识管理的自适应中文分词算法,55000,12000,46000026,ntsfmqbrvazdpcuyeoiljhkgxw,(nrnsntnz),3922),P(W)0,P(W),,Adddelta[17],3)si=ci1ci2cij,c1c2cn,22,,2,,38,,{c1c2,c2c3,c3c4,,cn-1cn},P(c1c2)P(c2c3)c1c2c2c3,,24),,,,s,,,,+,2,,,(1)(2),,/;,2,22,,32,,,///,,,,,,,3.52gram2gram12,,2,::,,2gram///(a+a,+),+,,:///4CPUIntelCore2Duo1.80GHZ,1G,WindowsServer2003SOGOU9,,IT,,,,,,,,1990&∋19981,3(3,)20,4.11)SOGOU:,(PCMA)114重庆大学学报第33卷,,,,://////(PCMA)//////////////////////////////////////////////////////2):,,,,:///////////////////////////////////////////////////////////////////():///////////////////////////////////////////////////////////////////4.2,,,,ICTCLAS,,,20Sogou218,(P)=;(R)=28,8,1P(Sogou)R(Sogou)P()R()SACWSA0.960.9560.970.968ICTCLAS0.960.950.960.952SACWSASogouICTCLAS,ICTCLAS97%,4.33,203000~50001),,((((((,,,2).&∋()200811,((,1,,115第10期冯永,等:面向文本知识管理的自适应中文分词算法213)((,300/t,100/t((,,,:(Poov)=;(Roov)=,22Poov(SACWSA)Roov(SACWSA)Poov(ICTCLAS)Roov(ICTCLAS)0.980.960.440.410.980.950.540.500.960.970.480.36:ICTCLAS,SACWSA,3,,24.4,,Internet,C,JavaVB,,ICTCLAS800k/s~2M/s,500k/s~1M/s,JAVA,300~700k/s,300k/s1)Ngram,,P(W),,SACWSA,,,2,,SOGOU,66:SACWSA,,,,,,,2)3()/%202752048520895.5242772448624894.528276.52849028895,,,,116重庆大学学报第33卷,,,,352!SACWSA,,,3,,:[1]GAOJF,WUAD,LIM.AdaptiveChinesewordsegmentation[C]//Proceedingsofthe42ndAnnualMeetingonAssociationforComputationalLinguistics.[s.l.]:ACL2004,2004:462469.[2]ZHANGMY,LUZD,ZOUCY.AChinesewordsegmentationbasedonlanguagesituationinprocessingambiguouswords[J].InformationSciences,2004,162(34):275285.[3]WANGXJ,QINY,LIUW.AsearchbasedChinesewordsegmentationmethod[C].Proceedingsofthe16thInternationalWorldWideWebConference,2007:11291130.[4]WANGHS,CUIMM.AChinesewordsegmentationbasedonmachinelearning[C]//Proceedingsofthe1stInternationalWorkshoponEducationTechnologyandComputerScience.[S.L.]ETCS2009,2009,2:610613.[5]HONGCM,CHENCM,CHIUCY.AutomaticextractionofnewwordsbasedonGoogleNewscorporaforsupportinglexiconbasedChinesewordsegmentationsystems[J].ExpertSystemswithApplications,2009,36(2):36413651.[6]ZENGD,WEIDH,CHAUM,etal.Chinesewordsegmentationforterrorismrelatedcontents[J].LectureNotesinComputerSc

1 / 8
下载文档,编辑使用

©2015-2020 m.777doc.com 三七文档.

备案号:鲁ICP备2024069028号-1 客服联系 QQ:2149211541

×
保存成功