Haploop&MongoDB

reow
3 ℃
2020-04-18

整理文档很辛苦，赏杯茶钱您下走！

还剩 ... 页未读，继续阅读 >>

免费阅读已结束，点击下载阅读编辑剩下 ... 页

阅读已结束，您可以下载文档离线阅读编辑

资源描述

HadoopandMongoDBIntroductionofOLTP,HadoopandMongoDBPresenter:YuboAgenda•OLTPandTraditionalRDBMS•BigDataChallengesforRDBMS–Datastorage,Retrieveandprocesschallenge–DataAnalysischallenge•MongoDB•Hadoop–HadoopDistributedFileSystem–MapReduceOLTPWhatisOLTPOnlinetransactionprocessing,orOLTP,isaclassofinformationsystemsthatfacilitateandmanagetransaction-orientedapplications,typicallyfordataentryandretrievaltransactionprocessing.Itinvolvesgatheringinputinformation,processingtheinformationandupdatingexistinginformationtoreflectthegatheredandprocessedinformation.AndgenerallytheOLTPbuildonthetraditionalRDBMSandgainagreatsuccessinthepastdecade.SoinmanycaseswhenwesayOLTP,werefertothetraditionalRDBMSandtheapplicationbuildonit.UseCases:•OnlineBanking•CRMsystem•OAsystem•SaleForceAdvantagesofRDBMS•TherearemanymatureRDBMSproducts,likeOracle,SQLServer,MySQLetc.•Havematurealgorithmonstoring,retrievingdataefficientlyonlittleandintermediatedatavolume•HavebuildinACIDpropertiestoensurethereliabilityandaccurancyofthebusinesstraction•FlexibleIndexmechanismtoimprovethedataretrievalChallengesforTraditionalRDBMSDataVolumeChallengeInrecentyearstherehasbeenanexplosionofdata,variednewersetsofsources,includingGlobalPositioningSystems(GPS),automatedtrackersandmonitoringsystems,aregeneratingalotofdata.TheselargervolumesofdatasetscangrowtohundredofTBforwhichRDBMShardtostoreandprocessSemi-structuredChallengeInparalleltothefastdatagrowth,dataisalsobecomingincreasinglysemi-structuredandsparse.ThismeansthetraditionaldatamanagementtechniquesaroundupfrontschemadefinitionandrelationalreferencesisalsobeingquestionedThequesttosolvetheproblems(store,retrieveandprocessthoselargeandsemi-structureddataefficiently)ledtotheemergenceofaclassofnewertypesofdatabaseproductswhichcalledNoSQLdatabase.isoneofthatkindofDatabase.DataAnalysis/ComputingChallengeTheexponentialgrowthofdataalsopresentchallengeforthedataanalysis.LikeforGoogle,Yahoo,Amazon,theyneedtogothroughterabytesandevenpetabytesofdatatofigureoutwhichwebsites/product/campaignwerepopular,whatkindsofadsappealedtopeople.GooglewasthefirsttopublicizeMapReduce–asystemtheyhadusedtoscaletheirdataprocessingneeds.DougCuttingsawanopportunityandledthechargetodevelopanopensourceversionofthisMapReducesystemcalledSoonafter,Yahooandothersralliedaroundtosupportthiseffort.Today,Hadoopisacorepartofthecomputinginfrastructureformanywebcompanies,suchasYahoo,Facebook,LinkedIn,andTwitter.ChallengesforTraditionalRDBMSNow,let’sdoasimpleintroductionforANDWhatisHadoopTheApacheHadoopsoftwarelibraryisaframeworkthatallowsforthedistributedprocessingoflargedatasetsacrossclustersofcomputersusingsimpleprogrammingmodels.Itisdesignedtoscaleupfromsingleserverstothousandsofmachines,eachofferinglocalcomputationandstorage.Theprojectincludesthesemodules:•HadoopCommon:ThecommonutilitiesthatsupporttheotherHadoopmodules.•HadoopDistributedFileSystem:Adistributedfilesystemthatprovideshigh-throughputaccesstoapplicationdata.•HadoopYARN:Aframeworkforjobschedulingandclusterresourcemanagement.•HadoopMapReduce:AYARN-basedsystemforparallelprocessingoflargedatasets.Requirement:GooglewanttoclassifytheSearchKeyWordsandfrequentResult:KeyWordsCountHadoop4213423C#543345T-SQL64354…………..……………..Source:AlltheweblogfilesChallenges:1.Morethan100billionlogfiles,TBdata,howtostorethedatawhichneedtoanalyzed?2.HowtoAnalyze/computebaseonsolargedata?StorageSolutions:•ScaleUp?SuperComputer?Soexpensive!NoteasytoscaleanymoreIOisstillabottleneckforlaterprocessStorageSolutions:•Let’sScaleOutClusterhavemaybethousandsofcommonPCSplitandDistributethedataamongtheclusterHowtohandledatanodecorruptissue?SplitfileintodifferentblockItmeansthedistributionisbaseonfileblockinsteadoffileReplicationautomaticallyWhereisthemetadata?That’sHadoopDistributedFileSystemAdvantage:•AutomaticallyDistributethefileblockamongthecluster•Replicationensurethedatawillnotmissifonedatanodedown•Userknownothingaboutthedistributionandreplication•EasytoscaleouttoaddmoremachineintotheclusterDisadvantage:•Costmorestoragetoensurereplication•SinglepointoffailurelimitationCalculationChallenge:•Movedatatocentralizedlocationandthencalculate?IOandnetworkwillbethebottleneckTheclientnodewhichdothecalculationwillbethebottleneckCalculationSolution:•Solet’sdistributethecalculationinsteadofdataEverydatanodeneedanagenttodothecalculationlocallyShouldhaveamasternodetoschedulethecalculationdistributionandcollecttheintermediateresultShouldeasytoscaleoutandprogrammingThatisMapReduce!!WhatisMapReduce:MapReduceisaprogrammingmodelforprocessinglargedatasetswithaparallel,distributedalgorithmonacluster.InspiredbythemapandreduceprimitivespresentinfunctionallanguagesMap:foreacheveryiteminalisttodosomeoperationandoutputanotherlistReduce:dosomekindofaggregationonalistandoutputanotherlistKeyValue1C#2SQLServer3Hadoop4SQLServer5Hadoop6HadoopKeyValueC#1SQLServer1Hadoop1SQLServer1Hadoop1Hadoop1MapKeyValueC#1SQLServer2Hadoop3ReduceCalculatetheSearchKeywordsfrequencyusingMapReduceKeyValueLog20120909.txtFilecontentKeyValueC#4321SQLServer54523H