Haploop&MongoDB

整理文档很辛苦,赏杯茶钱您下走!

免费阅读已结束,点击下载阅读编辑剩下 ...

阅读已结束,您可以下载文档离线阅读编辑

资源描述

HadoopandMongoDBIntroductionofOLTP,HadoopandMongoDBPresenter:YuboAgenda•OLTPandTraditionalRDBMS•BigDataChallengesforRDBMS–Datastorage,Retrieveandprocesschallenge–DataAnalysischallenge•MongoDB•Hadoop–HadoopDistributedFileSystem–MapReduceOLTPWhatisOLTPOnlinetransactionprocessing,orOLTP,isaclassofinformationsystemsthatfacilitateandmanagetransaction-orientedapplications,typicallyfordataentryandretrievaltransactionprocessing.Itinvolvesgatheringinputinformation,processingtheinformationandupdatingexistinginformationtoreflectthegatheredandprocessedinformation.AndgenerallytheOLTPbuildonthetraditionalRDBMSandgainagreatsuccessinthepastdecade.SoinmanycaseswhenwesayOLTP,werefertothetraditionalRDBMSandtheapplicationbuildonit.UseCases:•OnlineBanking•CRMsystem•OAsystem•SaleForceAdvantagesofRDBMS•TherearemanymatureRDBMSproducts,likeOracle,SQLServer,MySQLetc.•Havematurealgorithmonstoring,retrievingdataefficientlyonlittleandintermediatedatavolume•HavebuildinACIDpropertiestoensurethereliabilityandaccurancyofthebusinesstraction•FlexibleIndexmechanismtoimprovethedataretrievalChallengesforTraditionalRDBMSDataVolumeChallengeInrecentyearstherehasbeenanexplosionofdata,variednewersetsofsources,includingGlobalPositioningSystems(GPS),automatedtrackersandmonitoringsystems,aregeneratingalotofdata.TheselargervolumesofdatasetscangrowtohundredofTBforwhichRDBMShardtostoreandprocessSemi-structuredChallengeInparalleltothefastdatagrowth,dataisalsobecomingincreasinglysemi-structuredandsparse.ThismeansthetraditionaldatamanagementtechniquesaroundupfrontschemadefinitionandrelationalreferencesisalsobeingquestionedThequesttosolvetheproblems(store,retrieveandprocessthoselargeandsemi-structureddataefficiently)ledtotheemergenceofaclassofnewertypesofdatabaseproductswhichcalledNoSQLdatabase.isoneofthatkindofDatabase.DataAnalysis/ComputingChallengeTheexponentialgrowthofdataalsopresentchallengeforthedataanalysis.LikeforGoogle,Yahoo,Amazon,theyneedtogothroughterabytesandevenpetabytesofdatatofigureoutwhichwebsites/product/campaignwerepopular,whatkindsofadsappealedtopeople.GooglewasthefirsttopublicizeMapReduce–asystemtheyhadusedtoscaletheirdataprocessingneeds.DougCuttingsawanopportunityandledthechargetodevelopanopensourceversionofthisMapReducesystemcalledSoonafter,Yahooandothersralliedaroundtosupportthiseffort.Today,Hadoopisacorepartofthecomputinginfrastructureformanywebcompanies,suchasYahoo,Facebook,LinkedIn,andTwitter.ChallengesforTraditionalRDBMSNow,let’sdoasimpleintroductionforANDWhatisHadoopTheApacheHadoopsoftwarelibraryisaframeworkthatallowsforthedistributedprocessingoflargedatasetsacrossclustersofcomputersusingsimpleprogrammingmodels.Itisdesignedtoscaleupfromsingleserverstothousandsofmachines,eachofferinglocalcomputationandstorage.Theprojectincludesthesemodules:•HadoopCommon:ThecommonutilitiesthatsupporttheotherHadoopmodules.•HadoopDistributedFileSystem:Adistributedfilesystemthatprovideshigh-throughputaccesstoapplicationdata.•HadoopYARN:Aframeworkforjobschedulingandclusterresourcemanagement.•HadoopMapReduce:AYARN-basedsystemforparallelprocessingoflargedatasets.Requirement:GooglewanttoclassifytheSearchKeyWordsandfrequentResult:KeyWordsCountHadoop4213423C#543345T-SQL64354…………..……………..Source:AlltheweblogfilesChallenges:1.Morethan100billionlogfiles,TBdata,howtostorethedatawhichneedtoanalyzed?2.HowtoAnalyze/computebaseonsolargedata?StorageSolutions:•ScaleUp?SuperComputer?Soexpensive!NoteasytoscaleanymoreIOisstillabottleneckforlaterprocessStorageSolutions:•Let’sScaleOutClusterhavemaybethousandsofcommonPCSplitandDistributethedataamongtheclusterHowtohandledatanodecorruptissue?SplitfileintodifferentblockItmeansthedistributionisbaseonfileblockinsteadoffileReplicationautomaticallyWhereisthemetadata?That’sHadoopDistributedFileSystemAdvantage:•AutomaticallyDistributethefileblockamongthecluster•Replicationensurethedatawillnotmissifonedatanodedown•Userknownothingaboutthedistributionandreplication•EasytoscaleouttoaddmoremachineintotheclusterDisadvantage:•Costmorestoragetoensurereplication•SinglepointoffailurelimitationCalculationChallenge:•Movedatatocentralizedlocationandthencalculate?IOandnetworkwillbethebottleneckTheclientnodewhichdothecalculationwillbethebottleneckCalculationSolution:•Solet’sdistributethecalculationinsteadofdataEverydatanodeneedanagenttodothecalculationlocallyShouldhaveamasternodetoschedulethecalculationdistributionandcollecttheintermediateresultShouldeasytoscaleoutandprogrammingThatisMapReduce!!WhatisMapReduce:MapReduceisaprogrammingmodelforprocessinglargedatasetswithaparallel,distributedalgorithmonacluster.InspiredbythemapandreduceprimitivespresentinfunctionallanguagesMap:foreacheveryiteminalisttodosomeoperationandoutputanotherlistReduce:dosomekindofaggregationonalistandoutputanotherlistKeyValue1C#2SQLServer3Hadoop4SQLServer5Hadoop6HadoopKeyValueC#1SQLServer1Hadoop1SQLServer1Hadoop1Hadoop1MapKeyValueC#1SQLServer2Hadoop3ReduceCalculatetheSearchKeywordsfrequencyusingMapReduceKeyValueLog20120909.txtFilecontentKeyValueC#4321SQLServer54523H

1 / 38
下载文档,编辑使用

©2015-2020 m.777doc.com 三七文档.

备案号:鲁ICP备2024069028号-1 客服联系 QQ:2149211541

×
保存成功