解構⼤大數據架構⼤大數據系統的伺服器與網路資源規劃“Howtoeatanelephant–onebyteatatime”CPLi李俊邦EnterpriseTechnologistEnterpriseSolutions&Alliances,GreaterChinaDell2議程1. 不同的伺服器⾓角⾊色1. Manager2. NameNodes3. EdgeNodes4. DataNodes2. HadoopCluster設計3. Etu+Dell4. Futures/Roadmap5. Questions?3ServerRoles-Manager• 系統安裝圖形介⾯面/主控台• ⼤大多安裝在EdgeNode• 常⾒見版本– ClouderaManager– ApacheAmbari4ServerRoles–NameNodes• 存放HDFS的metadata• JobManagerforYARNdata-processingframework• Primary– Heartbeatsfromdatanodes– 10thheartbeatisablockreportfromwhichitgeneratesmetadata• Standby– Checksineveryhourtomirrormetadata/blockmap– Notahot-spare–requiresmanualfail-over• HighAvailability(HA)canbeaddedinsomedistributions– ResultsinadedicatedHAnodethatactsasawitnesstotheNameNodecluster5ServerRoles-EdgeNodes• 資料進出Hadoop叢集的主要端⼝口• 可擴展• Hadoop叢集裡唯⼀一的多網段節點PowerEdge R730 – Name NodePowerEdge R730 – Standby Name NodePowerEdge R730 – Edge Node(s)PowerEdge R730 – HA NodeCorporate NetworkData NetworkCorporateData NetworkData NetworkData NetworkData NetworkPowerEdge R730XD – Data NodesData Network6ServerRoles-DataNode• HDFS的主要存放處• 執⾏行YARN資源管理所指定的資料處理• 主要屬性– 記憶體› 標配64GB› 更多服務(Impala/Spark)需要更多記憶體– 很多的本地硬碟(JBOD/Non-RAIDmode)› SFF(2.5”)forperformance-basedworkloads› LFF(3.5”)forcapacity-centricworkloads– CPUs–legacyrecommendationof1:1core:spindleratio› SSDs,fasterHDD(10K+),andin-memoryworkloadsmakethislessofanissue› 10and12corearethebestpracticedefaultHadoopClusterDesign8HadoopClusterDesign–HardwareConsiderations9HadoopClusterDeployment–InstallationBestPractices• Usepre-built,assembled&cabledracksfromvendor• ⾃自動佈署⼯工具(ex:OpenCrowbar)• Purchasenodesinstandardsizegroupsforeasycapacitygrowthandordering,notinsinglenodeincrements– Commonincrementsare½orfullrackforeasydeploymentandsizing• Foreachtypeofhardware,purchasesparecomponentstokeeponsiteforeasy,rapidrepair10CoreHadoopUseCases歸檔⾼高硬碟/CPU⽐比記憶體使⽤用低法規需求⻑⾧長期歸檔資料處理⾼高硬碟/CPU⽐比記憶體使⽤用中等DWoffloadETLoffloadEDH質量分析ITLog分析分析⾼高核⼼心數記憶體使⽤用⾼高市場分析詐欺預防網路分析11CommonHadoopUseCasetoEcosystemToolMapping12HadoopUseCasetoRatioMapping歸檔1:2:1資料處理1:4:1分析2:8:1CPU(Cores):Memory(GB):Disk(數量)–DataNode13NodeConsiderationsDellPowerEdgeR730DellPowerEdgeR730DellPowerEdgeR730DellPowerEdgeR730xd14NodeConsiderations15HDFSCapacity• HDFSprotectsinformationthroughreplicationofthedatabetweennodes,thedefaultReplicationFactoris3,butisconfigurable.• HDFSRawCapacity=NumberofComputeNodesxNumberofDrivesxCapacityofDrives• HDFSUsableCapacity=HDFSRawCapacity/ReplicationFactor16BigDataNetworkingBestPractices• TraditionalEthernetisusedsinceit’saffordableandalreadyprevalent.• 1GbEnetworkingwasusedinitiallyinearlydraftsofthesolutionbutwiththereductionincostit’smuchmoreefficienttogowith10GbE.• Multipleportsareteamedbothforredundancyandthroughput.LACPorsoftwarebondingarethemostcommonmethods.• IPv4ismostwidelyused.IPv6haslimitedsupportattheOSandHadooplevel.17AttributesofaGoodSwitchforBigData• Non-blockingbackplane• Deepper-portpacketbuffers(sharedbuffersdonotworkwell).Duringsort/shufflephasesofmap/reduceoperationsnetworktrafficissochaoticthatitcansaturateanyandallsharedbuffers,impactingmultiplehost’snetworkperformance.• Goodchoices:– 1GbE› S55› S60– 10GbE› S4810› S5000– 40GbE› Z9000› Z9500› S600018DellHadoopSolutionLogicalDiagram19Scale-outAggregationLayer20DellPointsofIntegration• VLT/VRRPisaveryaffordablewaytoteamswitchesbothattheToRandtheaggregationtiers.ThismakestheDellNetworkingForce10switchesagreatchoice.• ActiveFabricManager– SpeedsupthecreationandadministrationoftherequiredVLT/VRRPconfigurationontheswitches.– Helpswithcapacity-planningascustomerscale21BigDataNetworkingFutures• 40GbEonboardLOMswillbegintobeusedforhigh-volumeclusters.Rightnowthecost:benefitratioisn’tthereyet.• AsHPCandBigDataconverge,we’llstarttoseetheuseofIBfornode-to-nodeconnectivity.• In-memory(Spark/Impala)workloadsarereducingthebottlenecksthatusedtoexistatthediskandnowmovetotheprocessorandnetwork.Expectcustomerstobelookingtoincreasecorecountsandnetworkspeedtoovercomethis.@Dell_EnterpriseEnterpriseSolutionsEtu+Dell=completeHadoop/BigDatasolutionproviderBestofbreedClouderapartners-EtuAnalyticsoftwaresolutionsforBigDataDellProfessionalServicesforBigDataDellPowerEdge13GserversDellNetworkingsolutionsInstallationandconfigurationserviceCompleteend-to-endimplementationDiscoverPlanImplementInvestigate232.Store1.Integrate4.Act3.AnalyzeSolutionarchitectureAnalyticaloutputToadDataPointDesktop–integrate,cleanseDellBoomiCloud–integrate,correlateToadIntelligenceCentralDataaggregationandvirtualizationDellSTATISTICACustomerdataOrderdataEventsStockmarketdataAdvancedAnalyticsMarketingcampaignsDellStatisticaBigDataDesktop–crawl,saveSocialMedia24Futures• SpeedImprovementsinMap/Reduce• Morein-memoryworkloads– PossiblemovetoSparktoreplaceMap/Reduce• VirtualizedHadoop– VMWareBigDataExtensions– OpenstackSahara– MicrosoftHDInsights(Hortonworks)25DellIn-MemoryApplianceforClouderaEnterpriseConfigurationsataglanceMid-SizeConfiguration16NodeClusterPowerEegeR720-4InfrastructureNodeswithProSupportPowerEdgeR720XD-12DataNodesw