信息检索7-西北工业大学资料

整理文档很辛苦,赏杯茶钱您下走!

免费阅读已结束,点击下载阅读编辑剩下 ...

阅读已结束,您可以下载文档离线阅读编辑

资源描述

ChenQunSchoolofComputer,NWPU11/2016InformationRetrieval:TheoryandPracticeIRComponents:DocumentProcessingToday•ComponentsofIRsystems–ContentAnalysis–Stemming•StatisticalPropertiesofDocumentcollections–StatisticalDependence–WordAssociationsContentAnalysis•AutomatedTransformationofrawtextintoaformthatrepresentsomeaspect(s)ofitsmeaning•Including,butnotlimitedto:–AutomatedThesaurusGeneration–PhraseDetection–Categorization–Clustering–SummarizationTechniquesforContentAnalysis•Statistical–SingleDocument–FullCollection•Linguistic–Syntactic–Semantic•Knowledge-Based(ArtificialIntelligence)•Hybrid(Combinations)TextProcessing•StandardSteps:–Recognizedocumentstructure•titles,sections,paragraphs,etc.–Breakintotokens•usuallyspaceandpunctuationdelineated•specialissueswithAsianlanguages–Stemming/morphologicalanalysis–StoreininvertedindexDocumentProcessingStepsStatisticalPropertiesofText•Tokenoccurrencesintextarenotuniformlydistributed•Theyarealsonotnormallydistributed•TheydoexhibitaZipfdistributionPlottingWordFrequencybyRank•Mainidea:count–Howmanytokensoccur1time–Howmanytokensoccur2times–Howmanytokensoccur3times…•Nowranktheseaccordingtohowoftheyoccur.Thisiscalledtherank.PlottingWordFrequencybyRank•Sayforatextwith100tokens•Count–Howmanytokensoccur1time(50)–Howmanytokensoccur2times(20)…–Howmanytokensoccur7times(10)…–Howmanytokensoccur12times(1)–Howmanytokensoccur14times(1)•Sothingsthatoccurthemostoftensharethehighestrank(rank1).•Thingsthatoccurthefewesttimeshavethelowestrank(rankn).Manysimilardistributions…•Wordsinatextcollection•Librarybookcheckoutpatterns•IncomingWebPageRequests(Nielsen)•OutgoingWebPageRequests(Cunha&Crovella)•DocumentSizeonWeb(Cunha&Crovella)ZipfDistribution(linearandlogscale)ZipfDistribution•Theproductofthefrequencyofwords(f)andtheirrank(r)isapproximatelyconstant–Rank=orderofwords’frequencyofoccurrence•Anotherwaytostatethisiswithanapproximatelycorrectruleofthumb:–SaythemostcommontermoccursCtimes–ThesecondmostcommonoccursC/2times–ThethirdmostcommonoccursC/3times–…rCf/1RankFreq137system232knowledg324base420problem518abstract615model715languag815implem913reason1013inform1111expert1211analysi1310rule1410program1510oper1610evalu1710comput1810case199gener209formTheCorrespondingZipfCurveZoominontheKneeoftheCurve436approach445work455variabl465theori475specif485softwar495requir505potenti515method525mean535inher545data555commit565applic574tool584technolog594techniquZipfDistribution•TheImportantPoints:–afewelementsoccurveryfrequently–amediumnumberofelementshavemediumfrequency–manyelementsoccurveryinfrequentlyResolvingPower(vanRijsbergen79)Themostfrequentwordsarenotthemostdescriptive.StemmingandMorphologicalAnalysis•Goal:“normalize”similarwords•Morphology(“form”ofwords)–InflectionalMorphology•E.g,.inflectverbendingsandnounnumber•Neverchangegrammaticalclass–dog,dogs–DerivationalMorphology•Deriveonewordfromanother,•Oftenchangegrammaticalclass–build,building;health,healthySimple“S”stemming•IFawordendsin“ies”,butnot“eies”or“aies”–THEN“ies”“y”•IFawordendsin“es”,butnot“aes”,“ees”,or“oes”–THEN“es”“e”•IFawordendsin“s”,butnot“us”or“ss”–THEN“s”NULLHarman,JASISJan.1991StemmerExamplesTheSMARTstemmerThePorterstemmerTheIAGO!stemmer%tstemateate%tstemapplesappl%tstemformulaeformul%tstemappendicesappendix%tstemimplementationimple%tstemglassesglass%%pstemmerateat%pstemmerapplesappl%pstemmerformulaeformula%pstemmerappendicesappendic%pstemmerimplementationimplement%pstemmerglassesglass%%stemate|2eat|2apples|1apple|1formulae|1formula|1appendices|1appendix|1implementation|1implementation|1glasses|1glasses|1%ErrorsGeneratedbyPorterStemmer(Krovetz93)TooAggressiveTooTimidorganization/organeuropean/europepolicy/policecylinder/cylindricalarm/armysearch/searcherAutomatedMethods•Stemmers:–Verydumbrulesworkwell(forEnglish)–PorterStemmer:Iterativelyremovesuffixes–Improvement:passresultsthroughalexicon•Powerfulmultilingualtoolsexistformorphologicalanalysis中文分词技术-必要性•性感:最近小强常常对我说话的真实性感到怀疑;•白痴:李白痴呆地喝着酒,吟着诗;•如果:罐头不如果汁营养丰富;中文分词技术•基于词典的分词算法;词条匹配:从左到右/从右到左•基于统计的分词算法;上下文语境的关联分析•通常采用混合方法(词典+统计)中文分词技术-难在哪•歧义识别化妆和服装:化妆/和/服装,化妆/和服/装这个门把手坏了:请把手拿开乒乓球拍卖完了:乒乓/球拍/卖完了乒乓球/拍卖/完了中文分词技术-难在哪•新词识别人名,机构名,产品名,商标等;内塔尼亚/胡说?那英/国人在酒店死了。AssumptionsinIR•Statisticalindependenceofterms•DependenceapproximationsStatisticalIndependenceTwoeventsxandyarestatisticallyindependentiftheproductoftheirprobabilityoftheirhappeningindividuallyequalstheirprobabilityofhappeningtogether.),()()(yxPyPxPStatisticalIndependenceandDependence•Whatareexamplesofthingsthatarestatisticallyindependent?•Whatareexamplesofthingsthatarestatisticallydependent?StatisticalIndependencevs.StatisticalDependence•Howlikelyisaredcartodrivebygivenwe’veseenablackone?•Howlikelyistheword“ambulance”toappear,giventhatwe’veseen“caraccident”?•Colorofcarsdrivingbyareindependent(althoughmorefrequentcolorsaremorelikely)•Wordsintextarenotindependent(althoughagainmorefrequentwordsaremorelikely)LexicalAssociations•Subjectswritefirstwordthatcomestomind–doctor/nurse;black/white(Palermo&Jenkins64)•TextCorporayieldsimilarassociations

1 / 33
下载文档,编辑使用

©2015-2020 m.777doc.com 三七文档.

备案号:鲁ICP备2024069028号-1 客服联系 QQ:2149211541

×
保存成功