T6-S2-P1-韩小勇

整理文档很辛苦,赏杯茶钱您下走!

免费阅读已结束,点击下载阅读编辑剩下 ...

阅读已结束,您可以下载文档离线阅读编辑

资源描述

DeepDive–AmazonElasticMapReduce韩小勇WhyAmazonEMR?EasytoUseLaunchaclusterinminutesLowCostPayanhourlyrateElasticEasilyaddorremovecapacityReliableSpendlesstimemonitoringSecureManagedFirewallsFlexibleYoucontroltheclusterEasytodeployAWSConsoleCommandLineorusetheEMRAPIwithyourfavoriteSDKEasytomonitoranddebugMonitorDebugIntegratedwithAmazonCloudWatchMonitorCluster,NodeandIOTrydifferentconfigurationstofindyouroptimalarchitectureCPUc3familycc1.4xlargecc2.8xlargeMemorym2familyr3familyDisk/IOd2familyi2familyGeneralm1familym3familyChooseyourinstancetypesBatchMachineSparkandLargeprocesslearninginteractiveHDFSEasytoaddandremovecomputecapacityonyourcluster.Matchcomputedemandswithclustersizing.ResizableclustersSpotfortasknodesUpto90%offEC2on-demandpricingOn-demandforcorenodesStandardEC2pricingforon-demandcapacityEasytouseSpotInstancesMeetSLAatPredictablecostExceedSLAatlowercostReadDataDirectlyintoHive,Pig,StreamingandCascadingfromKinesisStreamsNoIntermediateDataPersistenceRequiredSimplewaytointroducerealtimesourcesintoBatchOrientedSystemsMulti-ApplicationSupport&AutomaticCheckpointingAmazonEMRIntegrationwithAmazonKinesisTheHadoopecosystemcanruninAmazonEMRUsebootstrapactionstoinstallapplications…•AmazonS3–Designedfor99.999999999%durability–Separatecomputeandstorage•ResizeandshutdownAmazonEMRclusterswithnodataloss•PointmultipleAmazonEMRclustersatsamedatainAmazonS3EMRFSmakesiteasiertoleverageAmazonS3•Betterperformanceanderrorhandlingoptions•Transparenttoapplications–justread/writeto“s3://”•Consistentview–Forconsistentlistandread-after-writefornewputs•SupportforAmazonS3server-sideandclient-sideencryption•FasterlistingusingEMRFSmetadataEMRFSclient-sideencryptionAmazonS3AmazonS3encryptionclientsEMRFSenabledforAmazonS3client-sideencryptionKeyvendor(AWSKMSoryourcustomkeyvendor)(client-sideencryptedobjects)AmazonS3EMRFSmetadatainAmazonDynamoDB•Listandread-after-writeconsistency•FasterlistoperationsNumberofobjectsWithoutConsistentViewsWithConsistentViews1,000,000147.7229.70100,00012.703.69FastlistingofAmazonS3objectsusingEMRFSmetadata*Testedusingasinglenodeclusterwitham3.xlargeinstance.HDFSisstillthereifyouneedit•Iterativeworkloads–Ifyou’reprocessingthesamedatasetmorethanonce–ConsiderusingSpark&RDDsforthistoo•DiskI/Ointensiveworkloads•PersistdataonAmazonS3anduseS3DistCptocopyto/fromHDFSforprocessingAmazonEMR–DesignPatternsEMRexample#1:BatchProcessingGBoflogspushedtoS3hourlyDailyEMRclusterusingHivetoprocessdataInputandoutputstoredinS3250AmazonEMRjobsperday,processing30TBofdata:Long-runningClusterDatapushedtoS3DailyEMRclusterETLdataintodatabase24/7EMRclusterrunningHBaseholdslast2yearsofdataFront-endserviceusesHBaseclustertopowerdashboardwithhighconcurrencyTBsoflogssentdailyLogsstoredinAmazonS3HiveMetastoreonAmazonEMREMRexample#3:InteractivequeryInteractivequeryusingPrestoonMulti-petabytewarehouse:StreamingdataprocessingTBsoflogssentdailyLogsstoredinAmazonKinesisAmazonKinesisClientLibraryAWSLambdaAmazonEMRAmazonEC2OptimizationsforstorageFileformats•RowOriented–TextFiles–SequenceFiles•Writableobject–AvroDataFiles•Describedbyschema•ColumnarFormat–ObjectRecordColumnar(ORC)–ParquetLogicalTableRoworientedColumnorientedChoosingtherightfileformat•Processingandquerytools–Hive,ImpalaandPresto•Evolutionofschema–AvroforSchemaandPrestoforStorage•Fileformat“splittability”–AvoidJSON/XMLFiles.Usethemasrecords•Compression-BlockorFileFilesizes•Avoidsmallfiles–Avoidanythingsmallerthan100MB•EachmapperprocessesasingleFile•Fewerfiles,matchingcloselytoblocksize–FewercallstoAmazonS3–Fewernetwork/HDFSrequestsDealingwithSmallFiles•ReduceHDFSBlockSize,e.g.1MB(defaultis128MB)–--bootstrap-actions3://elasticmapreduce/bootstrap-actions/configure-hadoop--args“-m,dfs.block.size=1048576”•Better:useS3DistCPtocombinesmallerfilestogether–S3DistCPtakesapatternandtargetpathtocombinesmallerinputfilestolargerones–SupplyatargetsizeandcompressioncodecCompression•AlwaysCompressDataFilesOnAmazonS3–ReducesnetworktrafficbetweenAmazonS3andAmazonEMR–SpeedsUpYourJob•CompressMappersandReducerOutputAmazonEMRcompressesinter-nodetrafficwithLZOwithHadoop1,andSnappywithHadoop2ChoosingtherightCompression•Timesensitive,fastercompressionsareabetterchoice•Largeamountofdata,usespaceefficientcompressions•CombinedWorkload,usegzipAlgorithmSplittable?CompressionratioCompress+DecompressspeedGzip(DEFLATE)NoHighMediumbzip2YesVeryhighSlowLZOYesLowFastSnappyNoLowVeryfastCostsavingtipsforAmazonEMR•UseS3asyourpersistentdatastore;queryitusingPresto,Hive,Spark,etc.•Onlypayforcomputewhenyouneedit•UseAmazonEC2Spotinstancestosave80%•UseAmazonEC2Reservedinstancesforsteadyworkloads•UseCloudWatchalertstonotifyyouifaclusterisunderutilized,thenshutitdown.E.g.0mappersrunningforNhoursDEMO:ReadingTwitterStreamandshowTop10#topicseveryminute.UsingEMRsparkandscala.Showthefeatureof“eas

1 / 37
下载文档,编辑使用

©2015-2020 m.777doc.com 三七文档.

备案号:鲁ICP备2024069028号-1 客服联系 QQ:2149211541

×
保存成功