MEAPEditionManningEarlyAccessProgramCopyright2010ManningPublicationsFormoreinformationonthisandotherManningtitlesgoto–ADistributedProgrammingFrameworkPart1ofthisbookintroducesthebasicsforunderstandingandusingHadoop.WedescribethehardwarecomponentsthatmakeupaHadoopcluster,aswellastheinstallationandconfigurationtocreateaworkingsystem.WecovertheMapReduceframeworkatahighlevelandgetyourfirstMapReduceprogramupandrunning.13IntroducingHadoopThischaptercoversThebasicsofwritingascalable,■distributeddata-intensiveprogramUnderstandingHadoopandMapReduce■WritingandrunningabasicMapReduceprogram■Today,we’resurroundedbydata.Peopleuploadvideos,takepicturesontheircellphones,textfriends,updatetheirFacebookstatus,leavecommentsaroundtheweb,clickonads,andsoforth.Machines,too,aregeneratingandkeepingmoreandmoredata.Youmayevenbereadingthisbookasdigitaldataonyourcomputerscreen,andcertainlyyourpurchaseofthisbookisrecordedasdatawithsomeretailer.1Theexponentialgrowthofdatafirstpresentedchallengestocutting-edgebusinessessuchasGoogle,Yahoo,Amazon,andMicrosoft.Theyneededtogothroughterabytesandpetabytesofdatatofigureoutwhichwebsiteswerepopular,whatbookswereindemand,andwhatkindsofadsappealedtopeople.Existingtoolswerebecominginadequatetoprocesssuchlargedatasets.GooglewasthefirsttopublicizeMapReduce—asystemtheyhadusedtoscaletheirdataprocessingneeds.1Ofcourse,you’rereadingalegitimatecopyofthis,right?4CHAPTER1IntroducingHadoopThissystemarousedalotofinterestbecausemanyotherbusinesseswerefacingsimilarscalingchallenges,anditwasn’tfeasibleforeveryonetoreinventtheirownproprietarytool.DougCuttingsawanopportunityandledthechargetodevelopanopensourceversionofthisMapReducesystemcalledHadoop.Soonafter,Yahooandothersralliedaroundtosupportthiseffort.Today,Hadoopisacorepartofthecomputinginfrastructureformanywebcompanies,suchasYahoo,Facebook,LinkedIn,andTwitter.Manymoretraditionalbusinesses,suchasmediaandtelecom,arebeginningtoadoptthissystemtoo.Ourcasestudiesinchapter12willdescribehowcompaniesincludingNewYorkTimes,ChinaMobile,andIBMareusingHadoop.Hadoop,andlarge-scaledistributeddataprocessingingeneral,israpidlybecominganimportantskillsetformanyprogrammers.Aneffectiveprogrammer,today,musthaveknowledgeofrelationaldatabases,networking,andsecurity,allofwhichwereconsideredoptionalskillsacoupledecadesago.Similarly,basicunderstandingofdistributeddataprocessingwillsoonbecomeanessentialpartofeveryprogrammer’stoolbox.Leadinguniversities,suchasStanfordandCMU,havealreadystartedintroducingHadoopintotheircomputersciencecurriculum.Thisbookwillhelpyou,thepracticingprogrammer,getuptospeedonHadoopquicklyandstartusingittoprocessyourdatasets.ThischapterintroducesHadoopmoreformally,positioningitintermsofdistributedsystemsanddataprocessingsystems.ItgivesanoverviewoftheMapReduceprogrammingmodel.Asimplewordcountingexamplewithexistingtoolshighlightsthechallengesaroundprocessingdataatlargescale.You’llimplementthatexampleusingHadooptogainadeeperappreciationofHadoop’ssimplicity.We’llalsodiscussthehistoryofHadoopandsomeperspectivesontheMapReduceparadigm.ButletmefirstbrieflyexplainwhyIwrotethisbookandwhyit’susefultoyou.1.1Why“HadoopinAction”?Speakingfromexperience,IfirstfoundHadooptobetantalizinginitspossibilities,yetfrustratingtoprogressbeyondcodingthebasicexamples.ThedocumentationattheofficialHadoopsiteisfairlycomprehensive,butitisn’talwayseasytofindstraightfor-wardanswerstostraightforwardquestions.Thepurposeofwritingthebookistoaddressthisproblem.Iwon’tfocusonthenitty-grittydetails.InsteadIwillprovidetheinformationthatwillallowyoutoquicklycreateusefulcode,alongwithmoreadvancedtopicsmostoftenencounteredinpractice.1.2WhatisHadoop?Formallyspeaking,Hadoopisanopensourceframeworkforwritingandrunningdis-tributedapplicationsthatprocesslargeamountsofdata.Distributedcomputingisawideandvariedfield,butthekeydistinctionsofHadooparethatitisAccessible■—HadooprunsonlargeclustersofcommoditymachinesoroncloudcomputingservicessuchasAmazon’sElasticComputeCloud(EC2).WhatisHadoop?5HadoopclusterClientClientClientFigure1.1AHadoopclusterhasmanyparallelmachinesthatstoreandprocesslargedatasets.Clientcomputerssendjobsintothiscomputercloudandobtainresults.■Robust—Becauseitisintendedtorunoncommodityhardware,Hadoopisarchi-tectedwiththeassumptionoffrequenthardwaremalfunctions.Itcangracefullyhandlemostsuchfailures.Scalable■—Hadoopscaleslinearlytohandlelargerdatabyaddingmorenodestothecluster.Simple■—Hadoopallowsuserstoquicklywriteefficientparallelcode.Hadoop’saccessibilityandsimplicitygiveitanedgeoverwritingandrunninglargedistributedprograms.Evencollegestudentscanquicklyandcheaplycreate