硕士学位论文基于Hadoop的分布式网络爬虫技术DISTRIBUTEDWEBCRAWLERTECHNOLOGYBASEDONHADOOP郑博文哈尔滨工业大学2011年6月国内图书分类号:TP391.2学校代码:10213国际图书分类号:681.37密级:公开工学硕士学位论文基于Hadoop的分布式网络爬虫技术硕士研究生:郑博文导师:赵铁军教授申请学位:工学硕士学科:计算机科学与技术所在单位:计算机科学与技术学院答辩日期:2011年6月授予学位单位:哈尔滨工业大学ClassifiedIndex:TP391.2U.D.C.:681.37DissertationfortheMasterDegreeinEngineeringDISTRIBUTEDWEBCRAWLERTECHNOLOGYBASEDONHADOOPCandidate:ZhengBowenSupervisor:Prof.ZhaoTiejunAcademicDegreeAppliedfor:MasterofEngineeringSpeciality:ComputerScienceandTechnologyAffiliation:SchoolofComputerScienceandTechnologyDateofDefence:June,2011Degree-Conferring-Institution:HarbinInstituteofTechnology哈尔滨工业大学工学硕士学位论文-I-摘要如今我们正生活在一个信息爆炸的年代,随着互联网行业迅猛发展,这些信息每年以指数型增长,同时对于随时随地获取信息的需求也与日俱增,这些需求驱动了云计算的发展。在这个大背景之下,Google、IBM、Apache和Amazon等大型公司纷纷投入大量财力去发展云计算。其中Apache开发的Hadoop平台是一个对用户极为友好的开源云计算框架。本文所开发的分布式爬虫系统即是在此框架下设计和实现的。本文的目的设计并实现一个基于Hadoop的分布式爬虫系统,完成大规模数据采集的任务。同时,该爬虫系统采集信息类型为27种语言的主流新闻网站。该爬虫的采集方式为全站式信息采集,即抓取27种语言种子对应网站上的全部信息。另外,27种语言信息还要分别保存便于后面跨语言处理。本文全部工作中研究部分包括云计算相关知识介绍、Hadoop分布式平台相关知识介绍、网络爬虫原理和分布式爬虫发展现状调研。首先,对云计算的定义、原理和体系结构进行调研。然后,深入研究Hadoop平台的分布式文件系统(HDFS)和分布式计算模型(Map/Reduce)。接着讲述爬虫系统的原理,了解开发一个爬虫需要的流程。最后调研目前分布式爬虫系统的发展现状。上面这些研究为本文提供了技术基础,本文在此基础上提出了基于Hadoop的分布式网络爬虫系统的设计方案,包括爬虫系统的基本流程设计、框架设计、功能模块划分和各模块的Map/Reduce设计。在概要设计的基础之上,本文做出了系统的详细设计,实现整个系统,包括数据存储结构的实现、爬虫总体数据结构和各个功能模块的实现。最后,对本文做出详细总结。本文的意义在于实现了一个基于Hadoop的分布式爬虫系统,该系统采用Map/Reduce计算框架符合整个项目分布式框架。解决了单机爬虫效率低、可扩展性差等问题,提高了信息采集速度并扩大了信息采集的规模。为分布式跨语言信息获取和检索平台的索引模块和信息处理模块提供数据。关键词:分布式爬虫;Hadoop;HDFS;Map/Reduce哈尔滨工业大学工学硕士学位论文-II-AbstractTodaywearelivinginaeraofinformationexplosion,withtherapiddevelopmentoftheInternetindustry,thisinformationisgrowingexponentiallyeveryyear,foranytime,thedemandforaccesstoinformationisalsoincreasing,theserequirementsdrivethedevelopmentofcloudcomputing.Onthisbackground,Google,IBM,Apache,andAmazonandotherlargecompanieshaveinvestedsubstantialfinancialresourcestothedevelopmentofthecloud.ApacheHadoopdevelopmentplatformisaveryuser-friendlyopensourcecloudcomputingframework.Thedistributedcrawlersystemdevelopedinthispaperthatisinthisframework,designandimplementation.ThepurposeofthispaperistodesignandimplementacrawlersystembasedonHadoopdistributedtocompletethetaskoflarge-scaledatacollection.Meanwhile,thecrawlersystemcollectsinformationforthemainstreamnewssitesin27typesoflanguages.Thewayofcollectinginformationforcrawlersystemisallwebsite-basedcollection.Inaddition,theinformationin27languageswasalsosavedseparatelyforcross-languageprocessing.Alloftheworkofresearchinthispaperincludesrelevantknowledgedescribedcloudcomputing,Hadoopdistributedplatformknowledge,principlesofWebcrawlerandsurveyaboutdevelopmentofadistributedcrawler.First,researchthedefinitionofcloudcomputing,principlesandarchitecture.Then,in-depthstudyHadoopDistributedFileSystem(HDFS)andthedistributedcomputingmodel(Map/Reduce).Thenthearticledescribedtheprinciplesofcrawlersystemtounderstandtheprocessofdevelopingacrawlersystem.Finally,researchthedevelopmentofthecurrentstatusofthedistributedcrawlersystem.Thesestudiesaboveprovidedthetechnicalfoundationforthisarticle,thispaperputsforwardadesignfordistributedwebcrawlerbasedonHadoopsystem,includingthedesignforbasicflowofthecrawler,framedesign,functionmoduleandthemodule'sMap/Reducedesign.Basedontheoutlinedesign,thispapermadethedetaileddesign,andimplementstheentiresystem,Includingimplementationofdatastoragestructure,overalldatastructureofcrawlersystem哈尔滨工业大学工学硕士学位论文-III-andimplementationofthevariousfunctionalmodules.Finally,thismakesadetailedsummaryofthisarticle.ThisarticleisabouttheimplementationofadistributedcrawlersystembasedonHadoop,thesystemusesMap/Reducecomputingframeworkconsistentwiththeoverallprojectdistributedframework.Solvelowefficiencyandpoorscalabilityofthesinglecrawlersystem,improvedthespeedofinformationgatheringandexpandedthescaleofinformationcollection.Meanwhile,thesystemprovideddataforindexmoduleandinformationprocessingmoduleofdistributedcross-languageinformationaccessandretrievalplatform.Keywords:distributedcrawlersystem;hadoop;hdfs;map/reduce;哈尔滨工业大学工学硕士学位论文-IV-目录摘要...........................................................................................................................IAbstract.......................................................................................................................II第1章绪论................................................................................................................11.1课题来源...........................................................................................................11.2课题研究背景及意义.......................................................................................11.3本文主要工作及内容.......................................................................................41.4本文的主要组织和结构...................................................................................5第2章相关技术研究................................................................................................82.1云计算相关知识......................................................................................