云计算与大数据-孟小峰

整理文档很辛苦,赏杯茶钱您下走!

免费阅读已结束,点击下载阅读编辑剩下 ...

阅读已结束,您可以下载文档离线阅读编辑

资源描述

CloudComputingandBigDataXiaofengMengRenminUniversityofChina数据挖掘教学研讨会,北京,2012,8,9OutlineIntroductiontoBigData1CloudComputingandBigData234ConclusionChallengingProblemsOurWork5数据挖掘教学研讨会,北京,2012,8,9OutlineIntroductiontoBigData1CloudComputingandBigData234ConclusionChallengingProblemsOurWork5数据挖掘教学研讨会,北京,2012,8,9BigDataissohot!GoogleTrendsofBigDataBigDataAcrosstheFederalGovernment(USA,March,2012)数据挖掘教学研讨会,北京,2012,8,9数据挖掘教学研讨会,北京,2012,8,9WhatisBigData?数据挖掘教学研讨会,北京,2012,8,9DB(Database)vs.BD(BigData)“Smalldata”,VeryLargeDatabase(VLDB)MB,结构数据以数据为对象解决其存储和管理问题BigData,ExtremelyLargeDatabase(XLDB)PB,非结构数据以数据为资源解决诸领域问题数据工程数据思维DataEngineeringDataThinking数据挖掘教学研讨会,北京,2012,8,9社会的数字化与数字的社会化社会的数字化:数据足迹(dataprint)在数字化时代,各色人等有意无意留下的数据足迹越来越丰富数据足迹是有社会意义(socialmeaning)的,蕴含着社会结构数字的社会化:数据足迹及其结构本身就是社会结构和过程的一个环节,不断塑造着新的社会秩序和关系数据挖掘教学研讨会,北京,2012,8,9数据思维:计算社会科学一切社会解释、监控、预测与规划都离不开对数据足迹的收集、整理和分析计算社会科学方法:基于特定社会需要,在特定社会理论指导下,收集、整理和分析数据足迹,以便进行社会解释、监控、预测与规划的过程和活动数据挖掘教学研讨会,北京,2012,8,9WhatCanBigDatado?Prediction数据挖掘教学研讨会,北京,2012,8,9WhatCanBigDatado?华尔街根据民众情绪抛售股票对冲基金依据购物网站的顾客评论,分析企业产品销售情况银行根据求职网站的岗位数量,推断就业率投资机构收集并分析上市企业声明,从中寻找破产的蛛丝马迹美国疾病控制和预防中心依据网民搜索,分析全球范围内流感等病疫的传播情况美国总统奥巴马的竞选团队依据选民的微博,实时分析选民对总统竞选人的喜好数据挖掘教学研讨会,北京,2012,8,9WhatCanBigDatado?FraudDetectionHealthcareTransportationTelecommunicationsLifesciencesFinancialtransactions……….数据挖掘教学研讨会,北京,2012,8,9安阳殷墟遗址(公元前1300,距今3300年)数据挖掘教学研讨会,北京,2012,8,9甲骨文大坑,1万7千余片BigDataApplication应用用户数精确度可靠度数据量反应科学计算少极高低--中等Tera慢股市交易大量高极高Gega快Web数据大量中等--高中等Peta快微博数据大量中等--高中等100Peta快。。。数据挖掘教学研讨会,北京,2012,8,9OutlineIntroductiontoBigData1CloudComputingandBigData234ConclusionChallengingProblemsOurWork5数据挖掘教学研讨会,北京,2012,8,9CloudComputingandBigDataCloudComputingisjustlikethehighwaywhichcansupportavarietyoftransportationBigDatacanbeseenasonevehicleonthehighwayCloudComputingisinfrastructurewhileBigDataisitsserviceobject数据挖掘教学研讨会,北京,2012,8,9BigDataAnalysisPipelineAnalysisIntegrationExtraction&CleaningAcquisitionInterpretationCollaborationofcloudcomputingcangreatlypromotetheseprocessFrom:数据挖掘教学研讨会,北京,2012,8,9AcquisitionMultipledataresourceandhugeamountMuchofthisdataisofnointerestDataReductionisimportant数据挖掘教学研讨会,北京,2012,8,9Extraction&CleaningVariousdatatype:Structured&UnstructuredExtractionisoftenhighlyapplicationdependentMissinginformationanderrorinformationshouldbecleaned.数据挖掘教学研讨会,北京,2012,8,9IntegrationRightmetadataisneededtherehastobesometranslationofdataasitflowsfromonemodel(platform)totheother.E.g.TransferdatafromHadooptoDB2数据挖掘教学研讨会,北京,2012,8,9AnalysisFundamentallydifferentfromtraditionalstatisticalanalysisonsmallsamplesReal-timeanalysisLackofcoordinationbetweendatabasesystems数据挖掘教学研讨会,北京,2012,8,921%18%12%11%10%9%9%8%4%2%1%1%3%35%11%0%5%10%15%20%25%30%35%40%OracleExadataMicrosoftSQLPDWIBMDB2SmartAnalyticsSystemHadoop/MapreduceIBMNetzzaHPVerticaTeradataEDWEMCGreenplumSybaseIQInfobrightKognitbWX2ParAccelAnalyticDatabaseOtherWearen'tusingbigdataanalyticstoolsDon'tknowBigDataAnalyticsToolsinUse数据挖掘教学研讨会,北京,2012,8,9BatchProcess:MapReduceStreamProcess:Storm(Twitter),S4(Yahoo!)数据挖掘教学研讨会,北京,2012,8,9InterpretationBigdataisoflimitedvalueifuserscannotunderstandtheanalysisTheprovenanceoftheresultdataDatavisualization数据挖掘教学研讨会,北京,2012,8,9OutlineIntroductiontoBigData1CloudComputingandBigData234ConclusionChallengingProblemsOurWork5数据挖掘教学研讨会,北京,2012,8,9Data,DataandData!数据挖掘教学研讨会,北京,2012,8,9DifficulttogetthedataDataisallaroundyou!DatatypeisvariousMostdataisoccupiedbycompanyResearchersaredifficulttogetthedata数据挖掘教学研讨会,北京,2012,8,9NoSizeFitsAllWebdataSciencedataFinancialDataMovingObjectData………数据挖掘教学研讨会,北京,2012,8,9ScaleWemuststoreeverythingbecausewedon’tknowwhichpartofthedataisvaluable.FindaNeedleinHaystack数据挖掘教学研讨会,北京,2012,8,9“Dataiswidelyavailable;whatisscarceistheabilitytoextractwisdomfromit.”数据挖掘教学研讨会,北京,2012,8,9HalVarian,Google'schiefeconomist“大海捕鱼”vs.“池塘捕鱼”数据挖掘教学研讨会,北京,2012,8,9TimelinessManysituationsneedtheresultofanalysisimmediatelyReal-timeprocessingcanbeachallengewithbigdata,especiallyindynamicdataenvironmentslikefinancialtradingandsocialmedia.DeveloppartialresultsinadvanceandthendoincrementalcomputationNewindexstructuresarerequiredFrom:数据挖掘教学研讨会,北京,2012,8,9ParallelismParallelismacrossnodesinaclusterParallelismwithinasinglenodeCloudComputingNewhardware:SSD、PCM…数据挖掘教学研讨会,北京,2012,8,9ArchivalCPURAMDISKCPUSCMTAPERAMCPUDISKTAPE2013+ActiveStorageMemoryLogicTAPEDISKFLASHSSDRAM19802008fast,synchslow,asynchMemorylike…storagelikePrivacyManageprivacyisbothtechnicalandsociologicalproblemNewdatasourcebringnewproblems:LBS、Microblog….ShareprivatedatawhilelimitingdisclosureandensuringsufficientdatautilityintheshareddataDifferentialprivacyisaveryimportantstep,butitreducesinformationcontenttoofarinordertobeusefulinmostpracticalcasesFrom:数据挖掘教学研讨会,北京,2012,8,9OutlineIntroductiontoBigData1CloudComputingandBigData234ConclusionChallengingProblemsOurWork5数据挖掘教学研讨会,北京,2012,8,9大数据管理框架大数据特征多源异构:存在较大的异质性分布广泛:分布在各个区域动态增长:增长快,更新快数据-模式:先有数据后有模式如何高效管理海量数据?WebDataManagement(2000-now)2010200

1 / 91
下载文档,编辑使用

©2015-2020 m.777doc.com 三七文档.

备案号:鲁ICP备2024069028号-1 客服联系 QQ:2149211541

×
保存成功