基于延迟调度的Hadoop分布式文件系统复制方案(IJITCS-V7-N4-8)

整理文档很辛苦,赏杯茶钱您下走!

免费阅读已结束,点击下载阅读编辑剩下 ...

阅读已结束,您可以下载文档离线阅读编辑

资源描述

I.J.InformationTechnologyandComputerScience,2015,04,73-78PublishedOnlineMarch2015inMECS()DOI:10.5815/ijitcs.2015.04.08Copyright©2015MECSI.J.InformationTechnologyandComputerScience,2015,04,73-78DelaySchedulingBasedReplicationSchemeforHadoopDistributedFileSystemS.SureshDepartmentofComputerApplications,NationalInstituteofTechnology,Tiruchirappalli-620015,IndiaEmail:sureshtvmalai85@gmail.comN.P.GopalanDepartmentofComputerApplications,NationalInstituteofTechnology,Tiruchirappalli-620015,IndiaEmail:npgopalan@nitt.eduAbstract—Thedatageneratedandprocessedbymoderncomputingsystemsburgeonrapidly.MapReduceisanimportantprogrammingmodelforlargescaledataintensiveapplications.HadoopisapopularopensourceimplementationofMapReduceandGoogleFileSystem(GFS).Thescalabilityandfault-tolerancefeatureofHadoopmakesitasastandardforBigDataprocessing.HadoopusesHadoopDistributedFileSystem(HDFS)forstoringdata.Datareliabilityandfault-toleranceisachievedthroughreplicationinHDFS.Inthispaper,anewtechniquecalledDelaySchedulingBasedReplicationAlgorithm(DSBRA)isproposedtoidentifyandreplicate(dereplicate)thepopular(unpopular)files/blocksinHDFSbasedontheinformationcollectedfromthescheduler.Experimentalresultsshowthat,theproposedmethodachieves13%and7%improvementsinresponsetimeandlocalityoverexistingalgorithmsrespectively.IndexTerms—DynamicReplication,HDFS,DelayScheduling,HadoopMapreduceI.INTRODUCTIONAsdatagrowsrapidly,thecomplexityofprocessingbecomesachallenge.Applicationsareneedtoprocessverylargeamountofdataofdifferenttypeinshorttimetoachievebetteruserexperience.Toprovideabstracteddataservicestotheapplicationprograms,severalsolutionsareproposedrangingfromtraditionaldatabasestocurrentBigDatamanagementssystems.Theperformanceoftheapplicationismainlybasedonthesebackenddatamanagementsystems.Toenabledistributedprocessingwithhighavailability,fault-toleranceandloadbalancing,replicationmechanismistheevergreensolution.Ontheotherhand,maintainingconsistencyamongthereplicasindistributedenvironmentsisatimeconsumingprocesswhichinternaffectstheavailabilityandperformance.MostofthedatageneratedandprocessedbythecurrentBigDataapplicationsfollowthe‘writeonceandreadmany’patternswhicheliminatesthecomplexityofmaintainingconsistencyamongreplicas.RecentemergingdistributedfilesystemssuchasGoogleFileSystem(GFS)[1],HadoopDistributedFileSystems(HDFS)[2]usereplicationmechanismstoenablefaulttolerant,highperformanceparallelprocessing.Blindlyreplicatingallfiles/blocksatmanyplaceincreasestheavailabilityandfault-tolerance.Butwillincreasememoryrequirementproportionally.Findinghotspotandreplicatingthemmayyieldbetterperformancewithlessdemandonmemory.Determiningoptimalnumberofreplicaisachallengingandanactiveresearchproblemforalongtimeasitaddressesapplicationload,datasizeandqualityofservice,etc.Currentdistributedcomputingenvironmentssuchasgridcomputing,cloudcomputingaredesignedtoprocesspetabytesofdatainamassivelyparallelstyle.Asprocessingspeedincreasesrapidlywithadventofmulticoreprocessors,theunderlyingfilesystemsdeterminetheperformanceofcomputingenvironments.Tosupportstreamlikedataaccess,modernfilesystems(Bigtable[3],Cassandra[4])useverysimpledatamodelsupportinglimitednumberofoperations.SomeofthepopulardistributedfilesystemsandHadoop[5]isanemergingopensourceplatformforparalleldataprocessingforlargescaledataintensiveapplicationssupportedbyHDFS.Inthispaper,anewtechniquecalledDelaySchedulingBasedReplicationAlgorithm(DSBRA)isproposedtoidentifyandreplicatethepopularfiles/blocks(hotspots)inHDFSusingtheinformationcollectedfromDelaySchedulingtechnique.Theperformanceofproposedalgorithmisevaluatedbyexhaustiveexperiments.Itisobservedthat,itexcelsintermsofresponsetime,localityandfairness.Thepaperisorganizedasfollows:Section2givesbackgroundonHadoopandHDFS.Section3isdedicatedtorelatedworks.Section4elaboratestheproposedreplicationalgorithm.Sections5describethesimulationenvironmentanddiscussthesimulationresults.Section6concludesthepaperandhighlightsthefutureresearchdirections.II.HADOOPANDHDFSBACKGROUNDHadoopisapopularparallelprocessingframeworkforcloudenvironments.ItisanopensourceimplementationofMapReduce[6]andGFS[1].Duetosimplicityandscalabilityitbecomesade-factostandardfordata-intensiveapplications.HadoopprovidesanabstracteddistributedfaulttolerantenvironmentforBigDataprocessing.Thejobssubmittedtothesystemaredivided74DelaySchedulingBasedReplicationSchemeforHadoopDistributedFileSystemCopyright©2015MECSI.J.InformationTechnologyandComputerScience,2015,04,73-78intosmalltasksandexecutedparallellyonaclusterofcommodityhardwaremachines.Hadoopadoptsthemasterslavearchitecture.Usersneedtowriteonlytwofunctions:mapandreducefortheirapplications.Allotheroperationssuchassynchronization,parallelizationandhandlingfailuresarehandledbytheframework.Hadoopcontainstwomajorcomponents:(i)MapReduceisaruntimeenvironmentforparallelprocessingand(ii)HDFSisadistributedfilesystemforstoringinputandoutputfiles.MapReducehastwomajorcomponents:JobtrackerandTasktraker.Jobtrackeristhemastercomponenttokeeptrackofall

1 / 6
下载文档,编辑使用

©2015-2020 m.777doc.com 三七文档.

备案号:鲁ICP备2024069028号-1 客服联系 QQ:2149211541

×
保存成功