GoogleClusterComputingFacultyTrainingWorkshopModuleIII:NutchThispresentation©MichaelCafarellaRedistributedundertheCreativeCommonsAttribution3.0license.Meta-detailsBuilttoencouragepublicsearchworkOpen-source,w/pluggablemodulesCheaptorun,bothmachines&adminsGoal:Searchmorepages,withbetterquality,thananyotherenginePrettygoodrankingHasdone~200Mpages,morepossibleHadoopisaspinoffOutlineNutchdesignLinkdatabase,fetcher,indexer,etc…HadoopsupportDistributedfilesystem,jobcontrolWebDBFetcher2ofNFetcher1ofNFetcher0ofNFetchlist2ofNFetchlist1ofNFetchlist0ofNUpdate2ofNUpdate1ofNUpdate0ofNContent0ofNContent0ofNContent0ofNIndexer2ofNIndexer1ofNIndexer0ofNSearcher2ofNSearcher1ofNSearcher0ofNWebServer2ofMWebServer1ofMWebServer0ofMIndex2ofNIndex1ofNIndex0ofNInjectMovingPartsAcquisitioncycleWebDBFetcherIndexgenerationIndexingLinkanalysis(maybe)ServingresultsWebDBContainsinfoonallpages,linksURL,lastdownload,#failures,linkscore,contenthash,refcountingSourcehash,targetURLMustalwaysbeconsistentDesignedtominimizediskseeks19msseektimex200mnewpages/mo=~44daysofdiskseeks!Single-diskWebDBwashugeheadacheFetcherFetcherisverystupid.Nota“crawler”Pre-MapRed:divide“to-fetchlist”intokpieces,oneforeachfetchermachineURLsforonedomaingotosamelist,otherwiserandom“Politeness”w/ointer-fetcherprotocolsCanobserverobots.txtsimilarlyBetterDNS,robotscachingEasyparallelismTwooutputs:pages,WebDBedits2.Sortedits(externally,ifnecessary)WebDB/FetcherUpdatesURL::NeverContentHash:NoneURL::NeverContentHash:NoneURL::4/07/05ContentHash:MD5_toewkekqmekkalekaaURL::3/22/05ContentHash:MD5_sdflkjweroiwelksdEdit:DOWNLOAD_CONTENTURL::MD5_balboglerropewolefbagEdit:DOWNLOAD_CONTENTURL::MD5_toewkekqmekkalekaaEdit:NEW_LINKURL::NoneWebDBFetcheredits1.Writedownfetcheredits3.Readstreamsinparallel,emittingnewdatabase4.RepeatforothertablesURL::Today!ContentHash:MD5_balboglerropewolefbagURL::Today!ContentHash:MD5_toewkekqmekkalekaaIndexingIteratethroughallkpagesetsinparallel,constructinginvertedindexCreatesa“searchabledocument”of:URLtextContenttextIncominganchortextOthercontenttypesmighthaveadifferentdocumentfieldsEg,emailhassender/receiverAnysearchablefieldend-userwillwantUsesLucenetextindexerLinkanalysisApage’srelevancedependsonbothintrinsicandextrinsicfactorsIntrinsic:pagetitle,URL,textExtrinsic:anchortext,linkgraphPageRankismostfamousofmanyOthersinclude:HITSOPICSimpleincominglinkcountLinkanalysisissexy,butimportancegenerallyoverstatedLinkanalysis(2)NutchperformsanalysisinWebDBEmitascoreforeachknownpageAtindextime,incorporatescoreintoinvertedindexExtremelytime-consumingInourcase,disk-consuming,too(becausewewanttouselow-memorymachines)Fastandeasy:0.5*log(#incominglinks)“britney”QueryProcessingDocs0-1MDocs1-2MDocs2-3MDocs3-4MDocs4-5MDs2.3M,2.9M1.2M,4.4M,29,…AdministeringNutchAdmincostsarecriticalIt’sahasslewhenyouhave25machinesGooglehas100k,probablymoreFilesWebDBcontent,workingfilesFetchlists,fetchedpagesLinkanalysisoutputs,workingfilesInvertedindicesJobsEmitfetchlists,fetch,updateWebDBRunlinkanalysisBuildinvertedindicesAdministeringNutch(2)Adminsoundsboring,butit’snot!ReallyIswearLarge-filemaintenanceGoogleFileSystem(Ghemawat,Gobioff,Leung)NutchDistributedFileSystemJobControlMap/Reduce(DeanandGhemawat)Pig(YahooResearch)DataStorage(BigTable)NutchDistributedFileSystemSimilar,butnotidentical,toGFSRequirementsarefairlystrangeExtremelylargefilesMostfilesreadonce,fromstarttoendLowadmincostsperGBEquallystrangedesignWrite-once,withdeleteSinglefilecanexistacrossmanymachinesWhollyautomaticfailurerecoveryNDFS(2)DatadividedintoblocksBlockscanbecopied,replicatedDatanodesholdandserveblocksNamenodeholdsmetainfoFilenameblocklistBlockdatanode-locationDatanodesreportintonamenodeeveryfewsecondsNDFSFileReadNamenodeDatanode0Datanode1Datanode2Datanode3Datanode4Datanode51.Clientasksdatanodeforfilenameinfo2.Namenoderespondswithblocklist,andlocation(s)foreachblock3.Clientfetcheseachblock,insequence,fromadatanode“crawl.txt”(block-33/datanodes1,4)(block-95/datanodes0,2)(block-65/datanodes1,4,5)NDFSReplicationNamenodeDatanode0(33,95)Datanode1(46,95)Datanode2(33,104)Datanode3(21,33,46)Datanode4(90)Datanode5(21,90,104)1.Alwayskeepatleastkcopiesofeachblk2.Imaginedatanode4dies;blk90lost3.Namenodelosesheartbeat,decrementsblk90’sreferencecount.Asksdatanode5toreplicateblk90todatanode04.ChoosingreplicationtargetistrickyMap/ReduceMap/ReduceisprogrammingmodelfromLisp(andotherplaces)EasytodistributeacrossnodesNiceretry/failuresemanticsmap(key,val)isrunoneachiteminsetemitskey/valpairsreduce(key,vals)isrunforeachuniquekeyemittedbymap()emitsfinaloutputManypro