BigDataWeipingChenTopics•WhatisBigData?•Why‘BigData’isabigdeal?•NoSQLvsSQL•HowtoDealwithBigData?•What’sHadoop/MapReduce?•RDBMSvsHadoop/MapReduce•Bigdataplayers/SoftwareTools/Platforms•ExamplesWhatIsBigData?•Capturingandmanaginglotsofinformation•WorkingwithmanynewtypesofdataStructure/Unstructured•Exploitingthesemassesofinformationandnewdatatypeswithnewstylesofapplications•BiggerthanTerabytesvolume,variety,velocity,variabilityWhy‘BigData’isabigDealBigdatadiffersfromtraditionalinformationinmind-bendingways:NotknowingwhybutonlywhatThechallengewithleadershipisthatit’sverydrivenbygutinstinctinmostcasesAirtravelerscannowfigureoutwhichflightsarelikeliesttobeontime,thankstodatascientistswhotrackedadecadeofflighthistorycorrelatedwithweatherpatternsPublishersusedatafromtextanalysisandsocialnetworkstogivereaderspersonalizednews.healthcareisoneofthebiggestopportunities,IfwehadelectronicrecordsofAmericansgoingbackgenerations,we'dknowmoreaboutgeneticpropensities,correlationsamongsymptoms,andhowtoindividualizetreatments.Googlemapsearchcorrelateto“Openretailstoreetc.”WhatThisMeansforYouBigDatacanhelpacompanydomanythings:•Profilecustomers•Determinepricingstrategies•Identifycompetitiveadvantages•Bettertargetadvertising•Informinternalresearchandproductdevelopment•StrengthencustomerserviceMainstepsinadoptingananalyticalsystem•WhatWillWeAnalyze?•DoWeBuyorBuild?•AreWeReadytoInvest?•DoWeUnderstandtheImpact?Challenges•Informationgrowth•Processingpower•Physicalstoragediskcapacityincreasedramatically100MB/Sreadfromdisk(bottleneck)dataseekingtimeisslowthandatatransferring•Dataissues•CostsRecentlyITTrend•Commodityhardware•Distributedfilesystems•Opensourceoperatingsystems,databases,andotherinfrastructure•Significantlycheaperstorage•Service-orientedarchitectureBigDataChain•CollectData•Ingest/CleanData(OriginallyETL.Existingschema)•Humanexploration/Infrastructure/Datamining•Store/Archive•Share(decisionmake,othersystem)•Measure/feedbackACID•ACID(Atomicity,Consistency,Isolation,Durability)•(A)whenyoudosomethingtochangeadatabasethechangeshouldworkorfailasawhole•(C)thedatabaseshouldremainconsistent(thisisaprettybroadtopic)•(I)ifotherthingsaregoingonatthesametimetheyshouldn'tbeabletoseethingsmid-update•(D)ifthesystemblowsup(hardwareorsoftware)thedatabaseneedstobeabletopickitselfbackup;andifitsaysitfinishedapplyinganupdate,itneedstobecertainMapReduce•Dividingandconquering•Highlyfaulttolerantnodesareexpectedtofail•Everydatablock(bydefault)replicatedon3nodes(isalsorackaware)•DifficulttoimplementRDBMS•fixed-schema,row-orienteddatabaseswithACIDpropertiesandasophisticatedSQLqueryengine.•Theemphasisisonstrongconsistency,referentialintegrity,abstractionfromthephysicallayer,andcomplexqueriesthroughtheSQLlanguage.•easilycreatesecondaryindexes,performcomplexinnerandouterjoins,count,sum,sort,group,andpageyourdataacrossanumberoftables,rows,andcolumns.RDBMSvsMapReduce•RDBMSMapReducemostlystructureddataunstructureddatadatainternalstructurenone(doesinprocess)normalizedneednon-nomalizeNotes:1.relationaldatabasesstartincorporatingsomeoftheideasfromMapReduce(suchasAsterData’sandGreenplum’sdatabases)2.theotherdirection,ashigher-levelquerylanguagesbuiltonMapReduce(suchasPigandHive)makeMapReducesystemsmoreapproachablefortraditionaldatabaseprogrammers.ArchitechuresHowdoesMapReducework•HDFS(HadoopDistributedFileSystem)Dataisstoredonlocaldiskandprocessingisdonelocallyonthecomputerwiththedata•Canworkwithrawdatastoredinfilesystemordatabase•Twosteps:MapandReduceMap•MapReduceuseskey/valuepairs.(Traditionallyusingrowsandcolumns)Example:lastname/chenwithdrawamount/20transactiondate/06-23-2013Reduce•alltheintermediatevaluesforagivenoutputkeyarecombinedtogetherintoalist.•Thereduce()functionthencombinestheintermediatevaluesintooneormorefinalvaluesforthesamekey.Hadoop•Hadoopisdesignedtoabstractawaymuchofthecomplexityofdistributedprocessing•DifferentfromGRIDcomputing•WidelyusedSocialmedia(e.g.,Facebook,Twitter)LifesciencesFinancialservicesRetailGovernmentHadoopArchitecture•Applicationlayer/enduseraccesslayera.JobTracker(workloadmanagementlayer)b.Distributedparallelfilesystems/datalayerHadoopImplementation•Hadoopisdesignedtorunjobsthatlastminutesorhoursontrusted,dedicatedhardwarerunninginasingledatacenterwithveryhighaggregatebandwidthinterconnectsDesignofHDFS•Namenodes(TheMaster)Managemetadata/filetrees•Datanodes(Workers)store/retrievedatablock•DatanodesdonotuseRAIDdisk.HDFSround-robinsHDFSblocksbetweenalldisks.RAIDlimitedbytheslowestdiskonthearray.LimitationsofHDFS•Low-latencydataaccess•Lotsofsmallfiles•Multiplewriters,arbitraryfilemodificationsHDFS•Block64MB/128MB(normaldiskblock512KB).minimize‘seek’timefixedsizeratherthanfile,easystorage/replication%hadoopfsck/-files–blocks%hadoopfs–help(regularfilesystemoperation)%hadoopfs-copyFromLocalinput/docs/quangle.txthdfs://localhost/user/tom/quangle.txt%hadoopfs-mkdirbooks%hadoopfs-lsDataflowsFormatandTypes•MapReducemodelindetail,and,inparticular,howdatainvariousformats,fromsimpletexttostructuredbinaryobjects,canbeusedwiththismodelmap