Hadoop, Pig, and Twitter Presentation

整理文档很辛苦,赏杯茶钱您下走!

免费阅读已结束,点击下载阅读编辑剩下 ...

阅读已结束,您可以下载文档离线阅读编辑

资源描述

TMHadoopandPig@TwitterKevinWeil--@kevinweilAnalyticsLead,TwitterFriday,July23,2010Agenda‣HadoopOverview‣Pig:RapidLearningOverBigData‣Data-DrivenProducts‣Hadoop/PigandAnalyticsFriday,July23,2010MyBackground‣MathematicsandPhysicsatHarvard,PhysicsatStanford‣TroposNetworks(city-widewireless):meshroutingalgorithms,GBsofdata‣Cooliris(webmedia):HadoopandPigforanalytics,TBsofdata‣Twitter:Hadoop,Pig,HBase,Cassandra,machinelearning,visualization,socialgraphanalysis,soontobePBsdataFriday,July23,2010Agenda‣HadoopOverview‣Pig:RapidLearningOverBigData‣Data-DrivenProducts‣Hadoop/PigandAnalyticsFriday,July23,2010DataisGettingBig‣NYSE:1TB/day‣Facebook:20+TBcompressed/day‣CERN/LHC:40TB/day(15PB/year)‣Andgrowthisaccelerating‣Needmultiplemachines,horizontalscalabilityFriday,July23,2010Hadoop‣Distributedfilesystem(hardtostoreaPB)‣Fault-tolerant,handlesreplication,nodefailure,etc‣MapReduce-basedparallelcomputation(evenhardertoprocessaPB)‣Generickey-valuebasedcomputationinterfaceallowsforwideapplicabilityFriday,July23,2010Hadoop‣Opensource:top-levelApacheproject‣Scalable:Y!hasa4000-nodecluster‣Powerful:sortedaTBofrandomintegersin62seconds‣EasyPackaging:ClouderaRPMs,DEBsFriday,July23,2010MapReduceWorkflow‣Challenge:howmanytweetsperuser,giventweetstable?‣Input:key=row,value=tweetinfo‣Map:outputkey=user_id,value=1‣Shuffle:sortbyuser_id‣Reduce:foreachuser_id,sum‣Output:user_id,tweetcount‣With2xmachines,runs2xfasterInputsMapMapMapMapMapMapMapReduceReduceReduceOutputsShuffle/SortFriday,July23,2010MapReduceWorkflow‣Challenge:howmanytweetsperuser,giventweetstable?‣Input:key=row,value=tweetinfo‣Map:outputkey=user_id,value=1‣Shuffle:sortbyuser_id‣Reduce:foreachuser_id,sum‣Output:user_id,tweetcount‣With2xmachines,runs2xfasterInputsMapMapMapMapMapMapMapReduceReduceReduceOutputsShuffle/SortFriday,July23,2010MapReduceWorkflow‣Challenge:howmanytweetsperuser,giventweetstable?‣Input:key=row,value=tweetinfo‣Map:outputkey=user_id,value=1‣Shuffle:sortbyuser_id‣Reduce:foreachuser_id,sum‣Output:user_id,tweetcount‣With2xmachines,runs2xfasterInputsMapMapMapMapMapMapMapReduceReduceReduceOutputsShuffle/SortFriday,July23,2010MapReduceWorkflow‣Challenge:howmanytweetsperuser,giventweetstable?‣Input:key=row,value=tweetinfo‣Map:outputkey=user_id,value=1‣Shuffle:sortbyuser_id‣Reduce:foreachuser_id,sum‣Output:user_id,tweetcount‣With2xmachines,runs2xfasterInputsMapMapMapMapMapMapMapReduceReduceReduceOutputsShuffle/SortFriday,July23,2010MapReduceWorkflow‣Challenge:howmanytweetsperuser,giventweetstable?‣Input:key=row,value=tweetinfo‣Map:outputkey=user_id,value=1‣Shuffle:sortbyuser_id‣Reduce:foreachuser_id,sum‣Output:user_id,tweetcount‣With2xmachines,runs2xfasterInputsMapMapMapMapMapMapMapReduceReduceReduceOutputsShuffle/SortFriday,July23,2010MapReduceWorkflow‣Challenge:howmanytweetsperuser,giventweetstable?‣Input:key=row,value=tweetinfo‣Map:outputkey=user_id,value=1‣Shuffle:sortbyuser_id‣Reduce:foreachuser_id,sum‣Output:user_id,tweetcount‣With2xmachines,runs2xfasterInputsMapMapMapMapMapMapMapReduceReduceReduceOutputsShuffle/SortFriday,July23,2010MapReduceWorkflow‣Challenge:howmanytweetsperuser,giventweetstable?‣Input:key=row,value=tweetinfo‣Map:outputkey=user_id,value=1‣Shuffle:sortbyuser_id‣Reduce:foreachuser_id,sum‣Output:user_id,tweetcount‣With2xmachines,runs2xfasterInputsMapMapMapMapMapMapMapReduceReduceReduceOutputsShuffle/SortFriday,July23,2010But...‣AnalysistypicallyinJava‣Single-input,two-stagedataflowisrigid‣Projections,filters:customcode‣Joinsarelengthy,error-prone‣Hardtomanagen-stagejobs‣Explorationrequirescompilation!Friday,July23,2010Agenda‣HadoopOverview‣Pig:RapidLearningOverBigData‣Data-DrivenProducts‣Hadoop/PigandAnalyticsFriday,July23,2010EnterPig‣Highlevellanguage‣Transformationsonsetsofrecords‣Processdataonestepatatime‣EasierthanSQL?‣Top-levelApacheprojectFriday,July23,2010WhyPig?‣BecauseIbetyoucanreadthefollowingscript.Friday,July23,2010ARealPigScriptFriday,July23,2010Now,justforfun...‣ThesamecalculationinvanillaMapReduceFriday,July23,2010No,seriously.Friday,July23,2010PigDemocratizesLarge-scaleDataAnalysis‣ThePigversionis:‣5%ofthecode‣5%ofthedevelopmenttime‣Within25%oftheexecutiontime‣Readable,reusableFriday,July23,2010OneThingI’veLearned‣It’seasytoanswerquestions‣It’shardtoasktherightquestions‣ValuethesystemthatpromotesinnovationanditerationFriday,July23,2010Agenda‣HadoopOverview‣Pig:RapidLearningOverBigData‣Data-DrivenProducts‣Hadoop/PigandAnalyticsFriday,July23,2010MySQL,MySQL,MySQL‣Weallstartthere.‣ButMySQLisnotbuiltforanalysis.‣selectcount(*)fromusers?Maybe.‣selectcount(*)fromtweets?Uh...‣Imaginejoiningthem.‣Andgrouping.‣Thensorting.Friday,July23,2010Non-PigHadoopatTwitter‣DataSinkviaScribe‣DistributedGrep‣Afewperformance-critical,simplejobs‣PeopleSearchFriday,July23,2010PeopleSearch?‣FirstrealproductbuiltwithHadoop‣“FindPeople”‣Oldversion:offlineprocessonasinglenode‣Newversion:complexgraphcalculations,hitinternalnetworkservices,customindexing‣ Faster,morereliable,moreobservableFriday,July23,2010PeopleSearch‣ImportuserdataintoHBase‣PeriodicMapRed

1 / 35
下载文档,编辑使用

©2015-2020 m.777doc.com 三七文档.

备案号:鲁ICP备2024069028号-1 客服联系 QQ:2149211541

×
保存成功