北风网 Hadoop in Action

整理文档很辛苦,赏杯茶钱您下走!

免费阅读已结束,点击下载阅读编辑剩下 ...

阅读已结束,您可以下载文档离线阅读编辑

资源描述

MEAPEditionManningEarlyAccessProgramCopyright2010ManningPublicationsFormoreinformationonthisandotherManningtitlesgoto–ADistributedProgrammingFrameworkPart1ofthisbookintroducesthebasicsforunderstandingandusingHadoop.WedescribethehardwarecomponentsthatmakeupaHadoopcluster,aswellastheinstallationandconfigurationtocreateaworkingsystem.WecovertheMapReduceframeworkatahighlevelandgetyourfirstMapReduceprogramupandrunning.13IntroducingHadoopThischaptercoversThebasicsofwritingascalable,■distributeddata-intensiveprogramUnderstandingHadoopandMapReduce■WritingandrunningabasicMapReduceprogram■Today,we’resurroundedbydata.Peopleuploadvideos,takepicturesontheircellphones,textfriends,updatetheirFacebookstatus,leavecommentsaroundtheweb,clickonads,andsoforth.Machines,too,aregeneratingandkeepingmoreandmoredata.Youmayevenbereadingthisbookasdigitaldataonyourcomputerscreen,andcertainlyyourpurchaseofthisbookisrecordedasdatawithsomeretailer.1Theexponentialgrowthofdatafirstpresentedchallengestocutting-edgebusinessessuchasGoogle,Yahoo,Amazon,andMicrosoft.Theyneededtogothroughterabytesandpetabytesofdatatofigureoutwhichwebsiteswerepopular,whatbookswereindemand,andwhatkindsofadsappealedtopeople.Existingtoolswerebecominginadequatetoprocesssuchlargedatasets.GooglewasthefirsttopublicizeMapReduce—asystemtheyhadusedtoscaletheirdataprocessingneeds.1Ofcourse,you’rereadingalegitimatecopyofthis,right?4CHAPTER1IntroducingHadoopThissystemarousedalotofinterestbecausemanyotherbusinesseswerefacingsimilarscalingchallenges,anditwasn’tfeasibleforeveryonetoreinventtheirownproprietarytool.DougCuttingsawanopportunityandledthechargetodevelopanopensourceversionofthisMapReducesystemcalledHadoop.Soonafter,Yahooandothersralliedaroundtosupportthiseffort.Today,Hadoopisacorepartofthecomputinginfrastructureformanywebcompanies,suchasYahoo,Facebook,LinkedIn,andTwitter.Manymoretraditionalbusinesses,suchasmediaandtelecom,arebeginningtoadoptthissystemtoo.Ourcasestudiesinchapter12willdescribehowcompaniesincludingNewYorkTimes,ChinaMobile,andIBMareusingHadoop.Hadoop,andlarge-scaledistributeddataprocessingingeneral,israpidlybecominganimportantskillsetformanyprogrammers.Aneffectiveprogrammer,today,musthaveknowledgeofrelationaldatabases,networking,andsecurity,allofwhichwereconsideredoptionalskillsacoupledecadesago.Similarly,basicunderstandingofdistributeddataprocessingwillsoonbecomeanessentialpartofeveryprogrammer’stoolbox.Leadinguniversities,suchasStanfordandCMU,havealreadystartedintroducingHadoopintotheircomputersciencecurriculum.Thisbookwillhelpyou,thepracticingprogrammer,getuptospeedonHadoopquicklyandstartusingittoprocessyourdatasets.ThischapterintroducesHadoopmoreformally,positioningitintermsofdistributedsystemsanddataprocessingsystems.ItgivesanoverviewoftheMapReduceprogrammingmodel.Asimplewordcountingexamplewithexistingtoolshighlightsthechallengesaroundprocessingdataatlargescale.You’llimplementthatexampleusingHadooptogainadeeperappreciationofHadoop’ssimplicity.We’llalsodiscussthehistoryofHadoopandsomeperspectivesontheMapReduceparadigm.ButletmefirstbrieflyexplainwhyIwrotethisbookandwhyit’susefultoyou.1.1Why“HadoopinAction”?Speakingfromexperience,IfirstfoundHadooptobetantalizinginitspossibilities,yetfrustratingtoprogressbeyondcodingthebasicexamples.ThedocumentationattheofficialHadoopsiteisfairlycomprehensive,butitisn’talwayseasytofindstraightfor-wardanswerstostraightforwardquestions.Thepurposeofwritingthebookistoaddressthisproblem.Iwon’tfocusonthenitty-grittydetails.InsteadIwillprovidetheinformationthatwillallowyoutoquicklycreateusefulcode,alongwithmoreadvancedtopicsmostoftenencounteredinpractice.1.2WhatisHadoop?Formallyspeaking,Hadoopisanopensourceframeworkforwritingandrunningdis-tributedapplicationsthatprocesslargeamountsofdata.Distributedcomputingisawideandvariedfield,butthekeydistinctionsofHadooparethatitisAccessible■—HadooprunsonlargeclustersofcommoditymachinesoroncloudcomputingservicessuchasAmazon’sElasticComputeCloud(EC2).WhatisHadoop?5HadoopclusterClientClientClientFigure1.1AHadoopclusterhasmanyparallelmachinesthatstoreandprocesslargedatasets.Clientcomputerssendjobsintothiscomputercloudandobtainresults.■Robust—Becauseitisintendedtorunoncommodityhardware,Hadoopisarchi-tectedwiththeassumptionoffrequenthardwaremalfunctions.Itcangracefullyhandlemostsuchfailures.Scalable■—Hadoopscaleslinearlytohandlelargerdatabyaddingmorenodestothecluster.Simple■—Hadoopallowsuserstoquicklywriteefficientparallelcode.Hadoop’saccessibilityandsimplicitygiveitanedgeoverwritingandrunninglargedistributedprograms.Evencollegestudentscanquicklyandcheaplycreate

1 / 299
下载文档,编辑使用

©2015-2020 m.777doc.com 三七文档.

备案号:鲁ICP备2024069028号-1 客服联系 QQ:2149211541

×
保存成功