基于延迟调度的Hadoop分布式文件系统复制方案(IJITCS-V7-N4-8)

ghosttan
0 ℃
2021-03-12

整理文档很辛苦，赏杯茶钱您下走！

还剩 ... 页未读，继续阅读 >>

免费阅读已结束，点击下载阅读编辑剩下 ... 页

阅读已结束，您可以下载文档离线阅读编辑

资源描述

I.J.InformationTechnologyandComputerScience,2015,04,73-78PublishedOnlineMarch2015inMECS()DOI:10.5815/ijitcs.2015.04.08Copyright©2015MECSI.J.InformationTechnologyandComputerScience,2015,04,73-78DelaySchedulingBasedReplicationSchemeforHadoopDistributedFileSystemS.SureshDepartmentofComputerApplications,NationalInstituteofTechnology,Tiruchirappalli-620015,IndiaEmail:sureshtvmalai85@gmail.comN.P.GopalanDepartmentofComputerApplications,NationalInstituteofTechnology,Tiruchirappalli-620015,IndiaEmail:npgopalan@nitt.eduAbstract—Thedatageneratedandprocessedbymoderncomputingsystemsburgeonrapidly.MapReduceisanimportantprogrammingmodelforlargescaledataintensiveapplications.HadoopisapopularopensourceimplementationofMapReduceandGoogleFileSystem(GFS).Thescalabilityandfault-tolerancefeatureofHadoopmakesitasastandardforBigDataprocessing.HadoopusesHadoopDistributedFileSystem(HDFS)forstoringdata.Datareliabilityandfault-toleranceisachievedthroughreplicationinHDFS.Inthispaper,anewtechniquecalledDelaySchedulingBasedReplicationAlgorithm(DSBRA)isproposedtoidentifyandreplicate(dereplicate)thepopular(unpopular)files/blocksinHDFSbasedontheinformationcollectedfromthescheduler.Experimentalresultsshowthat,theproposedmethodachieves13%and7%improvementsinresponsetimeandlocalityoverexistingalgorithmsrespectively.IndexTerms—DynamicReplication,HDFS,DelayScheduling,HadoopMapreduceI.INTRODUCTIONAsdatagrowsrapidly,thecomplexityofprocessingbecomesachallenge.Applicationsareneedtoprocessverylargeamountofdataofdifferenttypeinshorttimetoachievebetteruserexperience.Toprovideabstracteddataservicestotheapplicationprograms,severalsolutionsareproposedrangingfromtraditionaldatabasestocurrentBigDatamanagementssystems.Theperformanceoftheapplicationismainlybasedonthesebackenddatamanagementsystems.Toenabledistributedprocessingwithhighavailability,fault-toleranceandloadbalancing,replicationmechanismistheevergreensolution.Ontheotherhand,maintainingconsistencyamongthereplicasindistributedenvironmentsisatimeconsumingprocesswhichinternaffectstheavailabilityandperformance.MostofthedatageneratedandprocessedbythecurrentBigDataapplicationsfollowthe‘writeonceandreadmany’patternswhicheliminatesthecomplexityofmaintainingconsistencyamongreplicas.RecentemergingdistributedfilesystemssuchasGoogleFileSystem(GFS)[1],HadoopDistributedFileSystems(HDFS)[2]usereplicationmechanismstoenablefaulttolerant,highperformanceparallelprocessing.Blindlyreplicatingallfiles/blocksatmanyplaceincreasestheavailabilityandfault-tolerance.Butwillincreasememoryrequirementproportionally.Findinghotspotandreplicatingthemmayyieldbetterperformancewithlessdemandonmemory.Determiningoptimalnumberofreplicaisachallengingandanactiveresearchproblemforalongtimeasitaddressesapplicationload,datasizeandqualityofservice,etc.Currentdistributedcomputingenvironmentssuchasgridcomputing,cloudcomputingaredesignedtoprocesspetabytesofdatainamassivelyparallelstyle.Asprocessingspeedincreasesrapidlywithadventofmulticoreprocessors,theunderlyingfilesystemsdeterminetheperformanceofcomputingenvironments.Tosupportstreamlikedataaccess,modernfilesystems(Bigtable[3],Cassandra[4])useverysimpledatamodelsupportinglimitednumberofoperations.SomeofthepopulardistributedfilesystemsandHadoop[5]isanemergingopensourceplatformforparalleldataprocessingforlargescaledataintensiveapplicationssupportedbyHDFS.Inthispaper,anewtechniquecalledDelaySchedulingBasedReplicationAlgorithm(DSBRA)isproposedtoidentifyandreplicatethepopularfiles/blocks(hotspots)inHDFSusingtheinformationcollectedfromDelaySchedulingtechnique.Theperformanceofproposedalgorithmisevaluatedbyexhaustiveexperiments.Itisobservedthat,itexcelsintermsofresponsetime,localityandfairness.Thepaperisorganizedasfollows:Section2givesbackgroundonHadoopandHDFS.Section3isdedicatedtorelatedworks.Section4elaboratestheproposedreplicationalgorithm.Sections5describethesimulationenvironmentanddiscussthesimulationresults.Section6concludesthepaperandhighlightsthefutureresearchdirections.II.HADOOPANDHDFSBACKGROUNDHadoopisapopularparallelprocessingframeworkforcloudenvironments.ItisanopensourceimplementationofMapReduce[6]andGFS[1].Duetosimplicityandscalabilityitbecomesade-factostandardfordata-intensiveapplications.HadoopprovidesanabstracteddistributedfaulttolerantenvironmentforBigDataprocessing.Thejobssubmittedtothesystemaredivided74DelaySchedulingBasedReplicationSchemeforHadoopDistributedFileSystemCopyright©2015MECSI.J.InformationTechnologyandComputerScience,2015,04,73-78intosmalltasksandexecutedparallellyonaclusterofcommodityhardwaremachines.Hadoopadoptsthemasterslavearchitecture.Usersneedtowriteonlytwofunctions:mapandreducefortheirapplications.Allotheroperationssuchassynchronization,parallelizationandhandlingfailuresarehandledbytheframework.Hadoopcontainstwomajorcomponents:(i)MapReduceisaruntimeenvironmentforparallelprocessingand(ii)HDFSisadistributedfilesystemforstoringinputandoutputfiles.MapReducehastwomajorcomponents:JobtrackerandTasktraker.Jobtrackeristhemastercomponenttokeeptrackofall