搜索引擎分析_Google(中英对照)

整理文档很辛苦,赏杯茶钱您下走!

免费阅读已结束,点击下载阅读编辑剩下 ...

阅读已结束,您可以下载文档离线阅读编辑

资源描述

TheAnatomyofaLarge-ScaleHypertextualWebSearchEngineSergeyBrinandLawrencePage{sergey,page}@cs.stanford.eduComputerScienceDepartment,StanfordUniversity,Stanford,CA94305AbstractInthispaper,wepresentGoogle,aprototypeofalarge-scalesearchenginewhichmakesheavyuseofthestructurepresentinhypertext.GoogleisdesignedtocrawlandindextheWebefficientlyandproducemuchmoresatisfyingsearchresultsthanexistingsystems.Theprototypewithafulltextandhyperlinkdatabaseofatleast24millionpagesisavailableat:WorldWideWeb,SearchEngines,InformationRetrieval,PageRank,Google1.Introduction(Note:Therearetwoversionsofthispaper--alongerfullversionandashorterprintedversion.ThefullversionisavailableonthewebandtheconferenceCD-ROM.)Thewebcreatesnewchallengesforinformationretrieval.Theamountofinformationonthewebisgrowingrapidly,aswellasthenumberofnewusersinexperiencedintheartofwebresearch.Peoplearelikelytosurfthewebusingitslinkgraph,oftenstartingwithhighqualityhumanmaintainedindicessuchasYahoo!orwithsearchengines.Humanmaintainedlistscoverpopulartopicseffectivelybutaresubjective,expensivetobuildandmaintain,slowtoimprove,andcannotcoverallesoterictopics.Automatedsearchenginesthatrelyonkeywordmatchingusuallyreturntoomanylowqualitymatches.Tomakemattersworse,someadvertisersattempttogainpeople'sattentionbytakingmeasuresmeanttomisleadautomatedsearchengines.Wehavebuiltalarge-scalesearchenginewhichaddressesmanyoftheproblemsofexistingsystems.Itmakesespeciallyheavyuseoftheadditionalstructurepresentinhypertexttoprovidemuchhigherqualitysearchresults.Wechoseoursystemname,Google,becauseitisacommonspellingofgoogol,or10100andfitswellwithourgoalofbuildingverylarge-scalesearchengines.1.1WebSearchEngines--ScalingUp:1994-2000Searchenginetechnologyhashadtoscaledramaticallytokeepupwiththegrowthoftheweb.In1994,oneofthefirstwebsearchengines,theWorldWideWebWorm()[McBryan94]hadanindexof110,000webpagesandwebaccessibledocuments.AsofNovember,1997,thetopsearchenginesclaimtoindexfrom2million(WebCrawler)to100millionwebdocuments(fromSearchEngineWatch).Itisforeseeablethatbytheyear2000,acomprehensiveindexoftheWebwillcontainoverabilliondocuments.Atthesametime,thenumberofqueriessearchengineshandlehasgrownincrediblytoo.InMarchandApril1994,theWorldWideWebWormreceivedanaverageofabout1500queriesperday.InNovember1997,Altavistaclaimedithandledroughly20millionqueriesperday.Withtheincreasingnumberofusersontheweb,andautomatedsystemswhichquerysearchengines,itislikelythattopsearchengineswillhandlehundredsofmillionsofqueriesperdaybytheyear2000.Thegoalofoursystemistoaddressmanyoftheproblems,bothinqualityandscalability,introducedbyscalingsearchenginetechnologytosuchextraordinarynumbers.1.2.Google:ScalingwiththeWebCreatingasearchenginewhichscaleseventotoday'swebpresentsmanychallenges.Fastcrawlingtechnologyisneededtogatherthewebdocumentsandkeepthemuptodate.Storagespacemustbeusedefficientlytostoreindicesand,optionally,thedocumentsthemselves.Theindexingsystemmustprocesshundredsofgigabytesofdataefficiently.Queriesmustbehandledquickly,atarateofhundredstothousandspersecond.ThesetasksarebecomingincreasinglydifficultastheWebgrows.However,hardwareperformanceandcosthaveimproveddramaticallytopartiallyoffsetthedifficulty.Thereare,however,severalnotableexceptionstothisprogresssuchasdiskseektimeandoperatingsystemrobustness.IndesigningGoogle,wehaveconsideredboththerateofgrowthoftheWebandtechnologicalchanges.Googleisdesignedtoscalewelltoextremelylargedatasets.Itmakesefficientuseofstoragespacetostoretheindex.Itsdatastructuresareoptimizedforfastandefficientaccess(seesection4.2).Further,weexpectthatthecosttoindexandstoretextorHTMLwilleventuallydeclinerelativetotheamountthatwillbeavailable(seeAppendixB).ThiswillresultinfavorablescalingpropertiesforcentralizedsystemslikeGoogle.1.3DesignGoals1.3.1ImprovedSearchQualityOurmaingoalistoimprovethequalityofwebsearchengines.In1994,somepeoplebelievedthatacompletesearchindexwouldmakeitpossibletofindanythingeasily.AccordingtoBestoftheWeb1994--Navigators,Thebestnavigationserviceshouldmakeiteasytofindalmos

1 / 58
下载文档,编辑使用

©2015-2020 m.777doc.com 三七文档.

备案号:鲁ICP备2024069028号-1 客服联系 QQ:2149211541

×
保存成功