n-gram和数据平滑

整理文档很辛苦,赏杯茶钱您下走!

免费阅读已结束,点击下载阅读编辑剩下 ...

阅读已结束,您可以下载文档离线阅读编辑

资源描述

n-gramchbb@pku.edu.cn(LanguageModeling)sP(s)s1=s2=P(s1)P(s2)P(s)PLPPLOCR1)(=∑∈LssPP(s)P(s)=f(θ1,θ2,…,θn)θis=w1w2…wnP(s)?(Chainrule)P(“JOHNREADABOOK”)=p(“JOHN”)p(“READ|JOHN”)p(“A|JOHNREAD”)p(“BOOK|JOHNREADA”))()()()()(11213121−⋅⋅⋅=llw...w|wpww|wpw|wpwpsP∏=−=liiiw...w|wp111)(“”Johnreada______bookorvegetable?n-1“thelargegreen______.”„“mountain”?“tree”?“Sueswallowedthelargegreen______.”„“pill”?“broccoli”?“Sueswallowed”n-gramp(wi|w1w2…wi-1)n-1p(wi|wi-n+1…wi-1)n-gramn-gramn„„n„„n-gramunigram(n=1)p(wi)2000020000bigram(n=2)p(wi|wi-1)20000200002trigram(n=3)p(wi|wi-2wi-1)20000200003four-gram(n=4)(digramquadrigram)……n-gram:„„tokenization„BOSEOSŠIeat.ÆBOSIeat.EOSŠIsleep.ÆBOSIsleep.EOS„(MLE)c(w1,..,wn)n-gramw1,..,wn)()()(11111n-nn-nMLE,..,wwc,..,wwc,..,w|wwp=BOSJOHNREADMOBYDICKEOSBOSMARYREADADIFFERENTBOOKEOSBOSSHEREADABOOKBYCHEREOS31==)BOS(c)JOHNBOS(c)BOS|JOHN(p11==)JOHN(c)READJOHN(c)JOHN|READ(p32==)READ(c)AREAD(c)READ|A(p21==)A(c)BOOKA(c)A|BOOK(p21==)BOOK(c)EOSBOOK(c)BOOK|EOS(pJOHNREADABOOKP(JOHNREADABOOK)=p(JOHN|BOS)p(READ|JOHN)p(A|READ)p(BOOK|A)p(EOS|BOOK)=0.06bigram(underflow)Ln(P(JOHNREADABOOK))=Ln(p(JOHN|BOS))+Ln(p(READ|JOHN))+Ln(p(A|READ))+Ln(p(BOOK|A))+Ln(p(EOS|BOOK))=Ln(1/3)+Ln(1)+Ln(2/3)+Ln(1/2)+Ln(1/2)=-2.8902EOS(DataSparsness)CHERREADABOOKc(CHERREAD)=0Æp(READ|CHER)0Æp(CHERREADABOOK)=0()MLE0n-gram,n-gram0NLPZipfZipfwfr(wr)fr=k(k)wi50wj150wiwj3TomSawyer„71,370(wordtokens)„8,018(wordtypes)ZipfTomSawyer3993(50%)(wordtypes)10051%(wordtokens)ZipfZipf„k8000-9000„„3„r=100ZipfTomSawyer1-3Zipf0501001502002503003500500100015002000RankFreqZipf19981ZipfZipfZipf„„()ZipfPrincipleofLeasteffort()„„()Zipf„„Balh„150trigram„23%trigramMLE:„„„discounting()„Add-one„Add-delta„Witten-Bell„Good-Turing„Church-Gale„Jelinek-Mercer„Katz„……Add-one(Laplace’slaw)n-gramn-gram:new_count(n-gram)=old_count(n-gram)+1n-gram0Add-oneIwanttoeatChinesefoodlunch…Total(N)I810870130003437want3078606861215to301086030123256eat002019252938Chinese200001201213food1901700001506lunch4000010459…bigramIwanttoeatChinesefoodlunch…TotalI.0023(8/3437).320.0038(13/3437)0001want.00250.650.0049.0066.00491to.000920.0031.26.000920.00371eat00.00210.020.0021.0551Chinese.00940000.56.00471food.0130.01100001lunch.00870000.002201…bigram1stword2ndwordAdd-oneIwanttoeatChinesefoodlunch…Total(N+V)I891087108811411134375053want34178717972831to411186141134872eat11231203532554Chinese3111112121829food2011811113122lunch51111212075bigramIwanttoeatChinesefoodlunch…TotalI.0018(9/5053).22.0002.0028(14/5053).0002.0002.00021want.0014.00035.28.00035.0025.0032.00251to.00082.00021.0023.18.00082.00021.00271eat.00039.00039.0012.00039.0078.0012.0211Chinese.0016.00055.00055.00055.00055.066.00111food.0064.00032.0058.00032.00032.00032.000321lunch.0024.00048.00048.00048.00048.0022.000481bigramp(w1|w2)Add-oneN:n-gram(token)V:n-gram(type)VN…=…1)()(22111Add-onen-gram00n-gramn-gramNLPAdd-onen-gramn-gramn-gram()AP(ChurchandGale,1991)„22,000,000bigrams(token)„273,266(type)(74,674,306,756bigrams(type))„74,671,100,000bigrams„Add-onebigram0.0002950.0008224.2150.0006853.2340.0005482.2430.0004111.2520.0002740.44810.0002950.0000270fadd-onefempiricalfMLEtoohightoolowFreq.fromtrainingdataFreq.fromheld-outdataAdd-onesmoothedfreq.bigram=(74,671,100,000x0.000295)/22,000,000~99.96!!!!Add-delta(Lidstone’slaw)1,1λλ=0.5Add-oneVNλλ++…=…)()(2211V.N.)()(2211++…=…(Held-outEstimation)(Held-outdata)„:Š(trainingset):Š(heldoutdata):n-gramw1...wn:„Ctr(w1...wn)w1...wn„Cho(w1...wn)w1...wn:„r=n-gram„Nr=rn-gram„Tr=rn-gram„T=n-gram(token):∑==})(|{111)(rwwCwwnhorntrnwwCTLLL)(,1)(11ntrrrnhowwCrNTTwwPLL=×=)(1)(11ntrrrnhowwC,rNTTwwPLL=×=rn-gramNrrn-gram„r=510n-gram(types)5:N5=10„10n-gram20:T5=20„2000n-gram(token)0010101200020)5(.withrann-gramPho=×==(DeletedEstimation)…„part0part1„part0part1„part1part0„„(DeletedEstimationortwo-waycrossvalidation)r...ww,cNNNTT...wwpnrrrrndel=++=)()()(11010011Good-Turing:„n-gramn-gramTuringestimate„n-gramrrNNrr*1)1(++=rn-gramr+1n-gramNNNrN*100=Good-Turingrn-gramprr+1n-gramNr+1=0pr=0Nr“”S(.)S(r)Nr(Good-Turingestimate)S(.)?)()1()1(*rSrSrr++=Good-Turing[1]log(r)log(Nr)rNr0NrNrtrq,Nt0,Nr0,Nq0txrrxqNx=0)(5.0qtNZrr−=[1]Church,K.andW.Gale,(1991),AcomparisonoftheenhancedGood-TuringanddeletedestimationmethodsforestimatingprobabilitiesofEnglishbigrams,ComputerSpeechandLanguage,v.5,pp.19-54.Good-Turingr*r(r1)„r*rŠpr=r*/Nn-gram„rr*rr*/r1ŠrpMLE)log()log(rbaNr+=))log((,AaArNbr==Good-Turinglog(Zr)log(r)abb-1r*r1bbbr1r*)r11(rAr)1r(A)1r(NN)1r(r+++=++=+=Good-TuringTuringTuringTuringGood-TuringGood-TuringTuringGood-Turing1.65TuringTuring)1()1()(1212*rrrrTNNNNrVVar++++≈1)1(11≥⋅−=∑≥rpNpNNprunnormrrunnormrrn-gram()„bigramŠjournalofPunsmoothed(of|journal)=0ŠjournalfromPunsmoothed(from|journal)=0ŠjournalneverPunsmoothed(never|journal)=0„,“journalof”.„“of”“from”&“never”„unigramP(of)P(from)P(never)„unigrammodelbigrammodel„bigrammodeltrigrammodel„…n-gramn-gramn-gram(SimpleLinearInterpolation)n-gramPli(wn|wn-2,wn-1)=λ1P(wn)+λ2P(wn|wn-1)

1 / 53
下载文档,编辑使用

©2015-2020 m.777doc.com 三七文档.

备案号:鲁ICP备2024069028号-1 客服联系 QQ:2149211541

×
保存成功