萨师煊国际大数据分析与研究中心

整理文档很辛苦,赏杯茶钱您下走!

免费阅读已结束,点击下载阅读编辑剩下 ...

阅读已结束,您可以下载文档离线阅读编辑

资源描述

WeiyiMeng孟卫一DepartmentofComputerScienceStateUniversityofNewYorkatBinghamtonJuly9,2012Large-ScaleDistributedInformationRetrievalontheWeb萨师煊国际大数据分析与研究中心SummerResearchCampSeminarAboutSUNY–BinghamtonFoundedin1946afterWWII.LocatedinBinghamton–acityinSouthernTierofNewYorkStateAbout15,000students(3,000gradstudents)IBMwasfoundedinBinghamtonOneofthe4UniversityCentersofSUNYsystem:SUNYatStonyBrook,SUNYatBuffalo,SUNYatAlbany.Formoreinformation,seeInformationretrieval(IR)isacomputersciencedisciplineforfindingunstructureddata(usuallytextdocuments)thatsatisfyaninformationneedfromwithinlargecollectionsthatarestoredoncomputers.Inthisseminar,wearegoingtoextendthisdefinitiontoincludebothunstructuredandstructureddata.WhatisDistributedInformationRetrieval(DIR)?ItisaspecialbranchofinformationretrievalwherethedataoftheIRsystemarestoredinmultipledistributedlocations/collections.IntheWebenvironment,DIRdealswithdatathataredistributedacrossmanywebsitesorwebservers.RelatedtermsforDIR:metasearchengine,federatedsearch,webDBintegrationsystemTheScale–HowLarge?ItcanbeaslargeasthenumberofdatasourcesontheWeb.A2007survey(Madhavanetal.2007)indicatestherewereabout50millionsearchableWebdatasourcesin2007.25millionforun-orlessstructureddata(webpages,weibo,…)25millionforstructureddata(webdatabases)WheredoWebdatareside?IcebergStructure:AsmallfractionisontheSurfaceWebwithmostlystaticwebpagesthatarecrawlablebyfollowinghyperlinks.Publiclyindexableportion:40-60billionpagesMostareintheDeepWebwithbothstructureddataandlessstructuredtextdocumentshiddenbehindnumeroussearchinterfaces.About1trillionpages/recordsTwoparadigmstoprovideintegratedaccesstoWebdataCrawling-based:GatherWebdatafromvariousWebserversand/orsearchenginesandbuildasearchindexforthegathereddata.SurfaceWebcrawlingDeepWebcrawlingMetasearching-based(DIR-based):Integrateexistingsearchenginesintofederatedsystems.MetasearchingtextdocumentsMetasearchingstructureddatabydomainAdvantagesofeachapproachCrawling-based:Completecontroloncrawleddata:CanaddmetadataCanlinkdatafromdifferentsourcesinadvanceCancreateanarchivegraduallyCompletecontrolonretrievingtechniquesandrankingfunctionsFastresponsetimeMetasearching-based:CapabilitiesofsearchenginescanbeleveragedNaturalclusteringofthedatabyindividualsearchenginescanbeutilizedThree-levelqueryevaluationprocess(SEselection,SEretrieval,resultmerging)canleadtobettereffectivenessMorelikelytoobtainfresherresultsDisadvantagesofeachapproachCrawling-based:DeepWebcrawlingdifficultOftenincompleteManysitesnotcrawlableLosesemantics/structureofthedataCannotleveragesearchengines’capabilitiesCrawlingdelayleadstolessup-to-dateresultsCopyrightandprivacyissuesMetasearching-based:PerformancedependsonthequalityofusedsearchenginesMaycausesearchenginestocrashAccesscouldbeblockedbysearchenginesNodirectcontrolofthedataSlowerresponsetimeConclusions?Bothtechnologies(crawling-basedandmetasearching-based)haveuniquevaluesandtheyshouldco-exist.Theyactuallycomplementeachother!Question:Isthereaneffectivewaytocombinebothtechnologiesintoasingleplatform?Ourseminarwillfocusonthemetasearching(DIR)-basedapproach.TwotypesofmetasearchingsystemsBecausestructuredandunstructureddatahaveverydifferentcharacteristics,theyareoftenhandledseparatelywithdifferenttechnologies.Metasearchingsystemsfortextdocuments(metasearchenginesorDIRsystems).Metasearchingsystemsforstructureddata,eachforagivendomain(Webdatabaseintegrationsystems).Wewillfirstintroducelarge-scalemetasearchenginesandthenintroducelarge-scaleWebdatabaseintegrationsystems.Duetolimitedtime,wewillfocusonchallengesandremainingchallenges,notoncurrentsolutions.Large-ScaleMetasearchEngines(MSE)useruserinterfacequerydispatcherresultmergersearchsearchsearchengine1engine2enginen......texttexttextsource1source2sourcenqueryresultAsimpleMSEarchitectureWhatisalarge-scaleMSE?Alarge-scalemetasearchengineneedstosatisfyALLofthefollowingrequirements:Itisametasearchengine.Itisconnectedtoalargenumberof(thousandsormore)componentsearchengines.Thecomponentsearchenginesarespecial-purposesearchenginesCoveringaspecificdomain:news,sports,medicine,……Coveringaspecificorganization:RenDa,IBM,ACM,……Whythethirdrequirement?ToretaintheadvantagesonfreshnessandsearchingthedeepWeb.Technicalchallengeswithlarge-scaleMSEScalableandaccuratesearchengineselectionMostsearchenginesareuselessforagivenuserquery.Best10results,10,000searchenginesatleast9990useless.UsinguselesssearchenginesisbadUnnecessarynetworktrafficWasteresourcesoflocalsearchenginesIncurhighercostatthemetasearchengineLeadtopooreffectivenessHowtoidentifythemostappropriatesearchenginesforanygivenqueryaccuratelyandinatimelymanner?Howtosummarizeasearchenginecontent(representative)?Howtocollecttherepresentative?Howtousetherepresentativestoperformselection?Technicalchallenges(cont.)AutomaticsearchengineinclusionintometasearchengineAutomaticconnectiontosearchengines(automaticconnecti

1 / 40
下载文档,编辑使用

©2015-2020 m.777doc.com 三七文档.

备案号:鲁ICP备2024069028号-1 客服联系 QQ:2149211541

×
保存成功