©2011IBMCorporationInformationManagement大数据平台技术交流吴敏达–资深技术顾问©2011IBMCorporationInformationManagement2从各种各样类型的巨大数据中,快速获得有价值信息的能力,就是大数据技术什么是大数据Variety:管理复杂的多角度关系和非关系类型的数据(你是否忽略利用的非结构化数据进行决策吗)Velocity:流数据或者大量数据的移动(你是否希望通过实时操作提供更好的结果)Volume:数据量从TB级到ZB级(你是否收集了所有数据,并在使用它吗)Veracity:1/3的领导在做业务决策时候不相信获得的信息©2011IBMCorporationInformationManagement大数据参考架构超越传统的数据仓库概念流计算Internet级别传统数据仓库In-MotionAnalyticsDataAnalytics,DataOperations&ModelBuildingResultsInternetScaleDatabase&WarehouseAt-RestDataAnalyticsResultsUltraLowLatencyResultsInfoSphereBigInsights传统/关系型数据源非传统/非关系型数据源传统/关系型数据源非传统/非关系型数据源©2011IBMCorporationInformationManagementCloud|Mobile|SecurityIBM大数据平台和应用框架通过可视化的方法采集、抽取、以及探查数据应用加速器,加速应用开发,快速实现分析价值BI/ReportingBI/ReportingExploration/VisualizationFunctionalAppIndustryAppPredictiveAnalyticsContentAnalyticsAnalyticApplications(分析应用)IBMBigDataPlatform(大数据平台)SystemsManagementApplications&DevelopmentVisualization&Discovery分析流数据,以及在大数据的是谁数据洞察数据管控(数据质量、生命周期、……)低成本地分析PB级结构化和非结构化数据操作型数据或者历史数据的,基于数据仓库内嵌分析Accelerators(加速器)InformationIntegration&Governance信息整合和管控HadoopSystemStreamComputingDataWarehouseContextualDiscovery索引和联邦的上下文相关分析©2011IBMCorporationInformationManagement议程IBMhadoop平台-BigInsightsIBM流计算-StreamsIBM数据仓库平台-pureData基于大数据平台的数据分析-DataExplorerIBM大数据优势汇总©2011IBMCorporationInformationManagement6ForresterWave关于大数据的报告©2011IBMCorporationInformationManagementBigInsights企业版连接和集成StreamsNetezzaTextprocessingengineandlibraryJDBCFlume基础架构JaqlHivePigHBaseMapReduceHDFSZooKeeperIndexingLuceneAdaptiveMapReduceOozieTextcompressionEnhancedsecurityFlexiblescheduler可选IBM产品分析和探查应用DB2BigSheetsWebCrawlerDistribfilecopyDBexportBoardreaderDBimportAdhocqueryMachinelearningDataprocessing...管理和开发工具管理控制台•Monitorclusterhealth,jobs,etc.•Add/removenodes•Start/stopservices•Inspectjobstatus•Inspectworkflowstatus•Deployapplications•Launchapps/jobs•Workwithdistribfilesystem•Workwithspreadsheetinterface•SupportREST-basedAPI•...REclipse开发工具•Textanalytics•MapReduceprogramming•Jaql,Hive,Pigdevelopment•BigSheetsplug-indevelopment•OozieworkflowgenerationIntegratedinstallerOpenSourceIBMIBMCognosBIBigSQLAcceleratorformachinedataanalysisAcceleratorforsocialdataanalysisGuardiumDataStageDataExplorerSqoopHCatalogGPFS–FPO©2011IBMCorporationInformationManagementBigInsights优势列表•HighPerformance&AvailabilityGPFS-FPO–Atleast2XfasterthanopensourceHadoop–17xthroughputspeedupfordocumentindexlookups–FaultresistanceforRealTimeData–POSIX•AdaptiveMapReduce•SQLInterface(BigSQL)•IntegratedInstall&MgtConsoles•SecurityLDAP+•HighspeedLZOCompression•DevelopmentTooling–environment,–testing,and–optimization•WarehouseRDBMS&StreamsIntegration•SystemT–TextAnalytics–BlazingFast,UsesUnstructureddata–doesnotrequirestructuring,(MapReduce)–CustomizedAnnotators•BigSheets–InsightEngineforanalyticsonMassiveamountsofdatainBigInsights.–PowerofMap/ReducewithinreachoftheBusinessprofessionalwithafamiliarSpreadsheet-likeenvironment.–BuiltinVisualizations•SystemML–MachineLeaning(Watson)–DirectlyimplementedMLalgorithmsonMapReduce–DeepStatistical/MiningembeddedintoBigInsightsPlatform•BigIndex–Distributedindexingandsearch–Parallelindexingandsearch企业级别基础设施企业级别分析能力©2011IBMCorporationInformationManagementGPFS-FPO与HDFS各项指标对比BigInsightsGPFS-FPO开源HDFS或其他方案健壮性无单点故障99.99%NameNode存在单点故障数据一致性高数据可能会丢失可扩展性数千节点,实测4000+数千节点POSIX兼容完全兼容有限数据管理能力安全、备份、快照、缓存、复制有限传统应用性能好,兼顾读写性能随机读写性能差安全性支持ACL,容量限制,安全认证不支持©2011IBMCorporationInformationManagementIBMAdaptiveMapReduce提供强大的企业级管理,用于在可扩展的共享网格上运行分布式应用程序和大数据分析。它可加速数十个并行应用程序,以加快实现成果并更好地利用所有可用资源。TeraSortThroughputSWIM10timesfewerCPUcores6timesfaster60timesfasterBerkleySWIMisaworkloadbenchmarkdevelopedatUniversityofCaliforniaatBerkley.MeasurecoreschedulingefficiencyofMapReduceworkloadsatHadoopWorld2011Multi-tenantresourcemanagement10xLesshardwareforthefastestTeraSortscore.©2011IBMCorporationInformationManagementBigSQL:让Hadoop原生支持SQL原生SQL支持BigInsights–ANSISQL92+–Standardsyntaxsupport(joins,datatypes,…)真正的JDBC/ODBC–Preparedstatements–Cancelsupport–DatabasemetadataAPIsupport–Securesocketconnections(SSL)优化–LeveragingMapReduceparallelismor…–Directaccessforlow-latencyqueries多种数据源–HBase(includingsecondaryindexes)–CSV,Delimitedfiles,Sequencefiles–JSON–HivetablesBigSQLEngineBigInsightsDataSourcesSQLHiveTablesHBasetablesCSVFilesApplicationJDBC/ODBCServerJDBC/ODBCDriver©2011IBMCorporationInformationManagement使用报表工具CognosBIserver可以下推计算到BigInsights更快响应时间没有Hive的限制Application(Map-Reduce)Storage(HBase,HDFS)InfoSphereBigInsightsCognosBIServerExplore&AnalyzeReport&ActSQLInterfaceviaJDBC©2011IBMCorporationInformationManagement可以使用已有的工具:SQuirreLSQLUsingexistingSQLtoolingagainstBigDataSupportfor“standard”authentication!!(notsupportedforHive,butsupportedbyBigSQL!)13©2011IBMCorporationInformationManagement可以使用已有的工具:EclipseUsingexistingSQLtoolingagainstBigDataSamesetupasforexistingSQLsources!!Supportfor“standard”authentication!!14©2011IBMCorporationInformationManagement集成的基于Web的安装无缝的单节点或者集群模式安装开源组件和IBM组件的安装验证检查,确保系统正常运行©2011IBMCorporationInformationManagement基于Web的管理控制平台任务和工作流管理系统健康监控集群以及文件系统