10-1©2013RajJain~jain/cse570-13/WashingtonUniversityinSt.LouisBigDataBigDataFundamentalsFundamentalsRajJainWashingtonUniversityinSaintLouisSaintLouis,MO63130Jain@cse.wustl.eduTheseslidesandaudio/videorecordingsofthisclasslectureareat:~jain/cse570-13/.10-2©2013RajJain~jain/cse570-13/WashingtonUniversityinSt.LouisOverviewOverview1.WhyBigData?2.Terminology3.KeyTechnologies:GoogleFileSystem,MapReduce,Hadoop4.Hadoopandotherdatabasetools5.TypesofDatabasesRef:J.Hurwitz,etal.,“BigDataforDummies,”Wiley,2013,ISBN:978-1-118-50422-210-3©2013RajJain~jain/cse570-13/WashingtonUniversityinSt.LouisBigDataBigDataDataismeasuredby3V's:Volume:TBVelocity:TB/sec.SpeedofcreationorchangeVariety:Type(Text,audio,video,images,geospatial,...)Increasingprocessingpower,storagecapacity,andnetworkinghavecauseddatatogrowinall3dimensions.Volume,Location,Velocity,Churn,Variety,Veracity(accuracy,correctness,applicability)Examples:socialnetworkdata,sensornetworks,InternetSearch,Genomics,astronomy,…10-4©2013RajJain~jain/cse570-13/WashingtonUniversityinSt.LouisWhyBigDataNow?WhyBigDataNow?1.Lowcoststoragetostoredatathatwasdiscardedearlier2.Powerfulmulti-coreprocessors3.Lowlatencypossiblebydistributedcomputing:Computeclustersandgridsconnectedviahigh-speednetworks4.VirtualizationPartition,Aggregate,isolateresourcesinanysizeanddynamicallychangeitMinimizelatencyforanyscale5.AffordablestorageandcomputingwithminimalmanpowerviacloudsPossiblebecauseofadvancesinNetworking10-5©2013RajJain~jain/cse570-13/WashingtonUniversityinSt.LouisWhyBigDataNow?(Cont)WhyBigDataNow?(Cont)6.Betterunderstandingoftaskdistribution(MapReduce),computingarchitecture(Hadoop),7.Advancedanalyticaltechniques(Machinelearning)8.ManagedBigDataPlatforms:Cloudserviceproviders,suchasAmazonWebServicesprovideElasticMapReduce,SimpleStorageService(S3)andHBase–columnorienteddatabase.Google’BigQueryandPredictionAPI.9.Open-sourcesoftware:OpenStack,PostGresSQL10.March12,2012:Obamaannounced$200MforBigDataresearch.DistributedviaNSF,NIH,DOE,DoD,DARPA,andUSGS(GeologicalSurvey)10-6©2013RajJain~jain/cse570-13/WashingtonUniversityinSt.LouisBigDataApplicationsBigDataApplicationsMonitorprematureinfantstoalertwheninterventionsisneededPredictmachinefailuresinmanufacturingPreventtrafficjams,savefuel,reducepollution10-7©2013RajJain~jain/cse570-13/WashingtonUniversityinSt.LouisACIDRequirementsACIDRequirementsAtomicity:Allornothing.Ifanythingfails,entiretransactionfails.Example,Paymentandticketing.Consistency:Ifthereiserrorininput,theoutputwillnotbewrittentothedatabase.Databasegoesfromonevalidstatetoanothervalidstates.Valid=Doesnotviolateanydefinedrules.Isolation:Multipleparalleltransactionswillnotinterferewitheachother.Durability:Aftertheoutputiswrittentothedatabase,itstaysthereforeverevenafterpowerloss,crashes,orerrors.RelationaldatabasesprovideACIDwhilenon-relationaldatabasesaimforBASE(BasicallyAvailable,Soft,andEventualConsistency)Ref:©2013RajJain~jain/cse570-13/WashingtonUniversityinSt.LouisTerminologyTerminologyStructuredData:Datathathasapre-setformat,e.g.,AddressBooks,productcatalogs,bankingtransactions,UnstructuredData:Datathathasnopre-setformat.Movies,Audio,textfiles,webpages,computerprograms,socialmedia,Semi-StructuredData:Unstructureddatathatcanbeputintoastructurebyavailableformatdescriptions80%ofdataisunstructured.Batchvs.StreamingDataReal-TimeData:Streamingdatathatneedstoanalyzedasitcomesin.E.g.,Intrusiondetection.Aka“DatainMotion”DataatRest:Non-realtime.E.g.,Salesanalysis.Metadata:Definitions,mappings,schemeRef:MichaelMinelli,BigData,BigAnalytics:EmergingBusinessIntelligenceandAnalyticTrendsforToday'sBusinesses,Wiley,2013,ISBN:'111814760X10-9©2013RajJain~jain/cse570-13/WashingtonUniversityinSt.LouisRelationalDatabasesandSQLRelationalDatabasesandSQLRelationalDatabase:Storesdataintables.A“Schema”definesthetables,thefieldsintablesandrelationshipsbetweenthetwo.Dataisstoredonecolumn/attributeSQL(StructuredQueryLanguage):Mostcommonlyusedlanguageforcreating,retrieving,updating,anddeleting(CRUD)datainarelationaldatabaseExample:TofindthegenderofcustomerswhoboughtXYZ:SelectCustomerID,State,Gender,ProductIDfrom“CustomerTable”,“OrderTable”whereProductID=XYZOrderNumberCustomerIDProductIDQuantityUnitPriceOrderTableCustomerIDCustomerNameCustomerAddressGenderIncomeRangeCustomerTableRef:©2013RajJain~jain/cse570-13/WashingtonUniversityinSt.LouisNonNon--relationalDatabasesrelationalDatabasesNoSQL:NotOnlySQL.Anydatabasethatusesnon-SQLinterfaces,e.g.,Python,Ruby,C,etc.forretrieval.Typicallystoredatainkey-valuepairs.Notlimitedtorowsorcolumns.DatastructureandqueryisspecifictothedatatypeHigh-performan