stanford大学-大数据挖掘-RelationExtraction-18

整理文档很辛苦,赏杯茶钱您下走!

免费阅读已结束,点击下载阅读编辑剩下 ...

阅读已结束,您可以下载文档离线阅读编辑

资源描述

CS345DataMiningMiningtheWebforStructuredDataOurviewofthewebsofar…WebpagesasatomicunitsGreatforsomeapplicationse.g.,ConventionalwebsearchButnotalwaystherightmodelGoingbeyondwebpagesQuestionansweringWhatistheheightofMtEverest?WhokilledAbrahamLincoln?RelationExtractionFindallcompany,CEOpairsVirtualDatabasesAnswerdatabase-likequeriesoverwebdataE.g.,FindallsoftwareengineeringjobsinFortune500companiesQuestionAnsweringE.g.,WhokilledAbrahamLincoln?NaïvealgorithmFindallwebpagescontainingtheterms“killed”and“AbrahamLincoln”incloseproximityExtractk-gramsfromasmallwindowaroundthetermsFindthemostcommonlyoccuringk-gramsQuestionAnsweringNaïvealgorithmworksfairlywell!SomeimprovementsUsesentencestructuree.g.,restricttonounphrasesonlyRewritequestionsbeforematching“WhatistheheightofMtEverest”becomes“TheheightofMtEverestisblank”ThenumberofpagesanalyzedismoreimportantthanthesophisticationoftheNLPForsimplequestionsReference:DumaisetalRelationExtractionFindpairs(title,author)WheretitleisthenameofabookE.g.,(Foundation,IsaacAsimov)Findpairs(company,hq)E.g.,(Microsoft,Redmond)Findpairs(abbreviation,expansion)(ADA,AmericanDentalAssociation)Canalsohavetupleswith2componentsRelationExtractionAssumptions:NosinglesourcecontainsallthetuplesEachtupleappearsonmanywebpagesComponentsoftupleappear“close”togetherFoundation,byIsaacAsimovIsaacAsimov’smasterpiece,theemFoundation/emtrilogyTherearerepeatedpatternsinthewaytuplesarerepresentedonwebpagesNaïveapproachStudyafewwebsitesandcomeupwithasetofpatternse.g.,regularexpressionsletter=[A-Za-z.]title=letter{5,40}author=letter{10,30}b(title)/bby(author)ProblemswithnaïveapproachApatternthatworksononewebpagemightproducenonsensewhenappliedtoanotherSopatternsneedtobepage-specific,oratleastsite-specificImpossibleforahumantoexhaustivelyenumeratepatternsforeveryrelevantwebsiteWillresultinlowcoverageBetterapproach(Brin)ExploitdualitybetweenpatternsandtuplesFindtuplesthatmatchasetofpatternsFindpatternsthatmatchalotoftuplesDIPRE(DualIterativePatternRelationExtraction)PatternsTuplesMatchGenerateDIPREAlgorithm1.RÃSampleTuplese.g.,asmallsetoftitle,authorpairs2.OÃFindOccurrences(R)OccurrencesoftuplesonwebpagesKeepsomesurroundingcontext3.PÃGenPatterns(O)LookforpatternsinthewaytuplesoccurMakesurepatternsarenottoogeneral!4.RÃMatchingTuples(P)5.ReturnorgobacktoStep2Occurrencese.g.,TitlesandauthorsRestricttocaseswhereauthorandtitleappearincloseproximityonwebpagelibFoundation/bbyIsaacAsimov(1951)url=order=[title,author](or[author,title])denoteas0or1prefix=“lib”(limittoe.g.,10characters)middle=“/bby”suffix=“(1951)”occurrence=(’Foundation’,’IsaacAsimov’,url,order,prefix,middle,suffix)PatternslibFoundation/bbyIsaacAsimov(1951)pbNightfall/bbyIsaacAsimov(1941)order=[title,author](say0)sharedprefix=bsharedmiddle=/bbysharedsuffix=(19pattern=(order,sharedprefix,sharedmiddle,sharedsuffix)URLPrefixPatternsmaybespecifictoawebsiteOrevenpartsofitAddurlprefixcomponenttopattern:libFoundation/bbyIsaacAsimov(1951):pbNightfall/bbyIsaacAsimov(1941)sharedurlprefix==(urlprefix,order,prefix,middle,suffix)GeneratingPatterns1.Groupoccurencesbyorderandmiddle2.LetO=setofoccurenceswiththesameorderandmiddlepattern.order=O.orderpattern.middle=O.middlepattern.urlprefix=longestcommonprefixofallurlsinOpattern.prefix=longestcommonprefixofoccurrencesinOpattern.suffix=longestcommonsuffixofoccurrencesinOExample:libFoundation/bbyIsaacAsimov(1951):pbNightfall/bbyIsaacAsimov(1941)order=[title,author]middle=“/bby”urlprefix=prefix=“b”suffix=“(19”Example:Foundation,byIsaacAsimov,hasbeenhailed…:Nightfall,byIsaacAsimov,tellsthetaleof…order=[title,author]middle=“,by”urlprefix=prefix=“”suffix=“,”PatternSpecificityWewanttoavoidgeneratingpatternsthataretoogeneralOneapproach:Forpatternp,definespecificity=|urlprefix||middle||prefix||suffix|Supposen(p)=numberofoccurencesthatmatchthepatternpDiscardpatternswheren(p)nminDiscardpatternspwherespecificity(p)n(p)thresholdPatternGenerationAlgorithm1.Groupoccurencesbyorderandmiddle2.LetO=asetofoccurenceswiththesameorderandmiddle3.p=GeneratePattern(O)4.Ifpmeetsspecificityrequirements,addptosetofpatterns5.Otherwise,trytosplitOintomultiplesubgroupsbyextendingtheurlprefixbyonecharacterIfalloccurencesinOarefromthesameURL,wecannotextendtheurlprefix,sowediscardOExtendingtheURLprefixSupposeOcontainsoccurencesfromurlsoftheform://=

1 / 35
下载文档,编辑使用

©2015-2020 m.777doc.com 三七文档.

备案号:鲁ICP备2024069028号-1 客服联系 QQ:2149211541

×
保存成功