Hadoop应用分享数据仓库日志平台许玉勤郭鹏•Hadoop概述•Hadoop在DW的应用•编写MR需要注意的问题•Hadoop各个子项目的介绍•成熟商用系统日程•Hadoop概述•Hadoop在DW的应用•编写MR需要注意的问题•Hadoop各个子项目的介绍•成熟商用系统日程•数据自动备份容灾•支持大文件存储•一次写入,多次读取HDFSMapReduceProvideeasybutgeneralmodelforprogrammerstouseclusterresourcesHidenetworkcommunication(i.e.RPCs)Hidestoragedetails,filechunksareautomaticallydistributedandreplicatedProvidetransparentfaulttoleranceFailedtasksareautomaticallyrescheduledonlivenodesHighthroughputandautomaticloadbalancingE.g.schedulingtasksonnodesthatalreadyhavedataAnoperatingsystemAprogramminglanguageMeantforonlineprocessingHadoopisNot…MapReduceisaprogrammingparadigm!HDFS和JobTracker都是单点HDFS不支持对文件的修改操作MapReduce不适合实时计算Hadoop的局限RDBMSMessagePassingInterface(MPI)Accessasharedfilesystem,hostedbyaSANCoordinatingtheprocessesinalarge-scaledistributedcomputationisachallengeGridComputingSETI@homeSearchforExtra-TerrestrialIntelligenceaSETI@homeworkunitisabout0.35MBofradiotelescopedata,andtakeshoursordaystoanalyzeonatypicalhomecomputer.VolunteerComputing•Hadoop概述•Hadoop在DW的应用•编写MR需要注意的问题•Hadoop各个子项目的介绍•成熟商用系统日程将Apache收集的访问日志上传到HDFS。运行MapReduce,计算出分析结果。从HDFS中下载分析结果,导入到Oracle。网站访问日志分析将需要分析的数据上传到HDFS。运行MapReduce,计算出分析结果。从HDFS中下载分析结果,导入到Oracle。推荐引擎MapReduceHadoop其他子项目未来应用•Hadoop概述•Hadoop在DW的应用•编写MR需要注意的问题•Hadoop各个子项目的介绍•成熟商用系统日程使用任何语言编写MapReduce程序。需要编写的Map或者是Reduce程序都是可以直接独立运行的程序。Map或者是Reduce的输入与输出通过程序的标准输入输出进行交互。多语言编程--StreamingPython版org.apache.hadoop.io.Text默认为utf-8编码使用Text.getBytes()字符集编码org.apache.hadoop.mapred.JobConfconfig.set(key,value);Stringvalue=job.get(key);全局变量传递MR的初始化与析构指定哪些key进入到同一个reduce中计算org.apache.hadoop.mapred.Partitioner分区TheMapReduceframeworksortstherecordsbykeybeforetheyreachthereducersForanyparticularkey,valuesarenotsorted排序190035°C190034°C190034°C...190136°C190135°C二次排序•Makethekeyacompositeofthenaturalkeyandthenaturalvalue.•Thekeycomparatorshouldorderbythecompositekey,thatis,thenaturalkeyandnaturalvalue.•Thepartitionerandgroupingcomparatorforthecompositekeyshouldconsideronlythenaturalkeyforpartitioningandgrouping.二次排序(续)CounterCountersareausefulchannelforgatheringstatisticsaboutthejobreporter.incrCounter(“counterName”,1);•Hadoop概述•Hadoop在DW的应用•编写MR需要注意的问题•Hadoop各个子项目的介绍•成熟商用系统日程HiveHive是建立在Hadoop上的数据仓库基础构架。它提供了一系列的工具,可以用来进行数据提取转化加载(ETL),这是一种可以存储、查询和分析存储在Hadoop中的大规模数据的机制。Hive定义了简单的类SQL查询语言,称为QL,它允许熟悉SQL的用户查询数据。同时,这个语言也允许熟悉MapReduce开发者的开发自定义的mapper和reducer来处理内建的mapper和reducer无法完成的复杂的分析工作。Hivevs.PigHBase•ConvenientbaseclassesforbackingHadoopMapReducejobswithHBasetables•Querypredicatepushdownviaserversidescanandgetfilters•Optimizationsforrealtimequeries•AhighperformanceThriftgateway•AREST-fulWebservicegatewaythatsupportsXML,Protobuf,andbinarydataencodingoptionsZooKeeperZooKeeperisacentralizedserviceformaintainingconfigurationinformation,naming,providingdistributedsynchronization,andprovidinggroupservices.•Hadoop概述•Hadoop在DW的应用•编写MR需要注意的问题•Hadoop各个子项目的介绍•成熟商用系统日程AmazonWebServices•dynamicwebserving,withfullsupportforcommonwebtechnologies•persistentstoragewithqueries,sortingandtransactions•automaticscalingandloadbalancing•APIsforauthenticatingusersandsendingemailusingGoogleAccounts•afullyfeaturedlocaldevelopmentenvironmentthatsimulatesGoogleAppEngineonyourcomputer•taskqueuesforperformingworkoutsideofthescopeofawebrequest•scheduledtasksfortriggeringeventsatspecifiedtimesandregularintervalsGoogleAppEngineQ&AThanks.