I.J.InformationTechnologyandComputerScience,2015,04,73-78PublishedOnlineMarch2015inMECS()DOI:10.5815/ijitcs.2015.04.08Copyright©2015MECSI.J.InformationTechnologyandComputerScience,2015,04,73-78DelaySchedulingBasedReplicationSchemeforHadoopDistributedFileSystemS.SureshDepartmentofComputerApplications,NationalInstituteofTechnology,Tiruchirappalli-620015,IndiaEmail:sureshtvmalai85@gmail.comN.P.GopalanDepartmentofComputerApplications,NationalInstituteofTechnology,Tiruchirappalli-620015,IndiaEmail:npgopalan@nitt.eduAbstract—Thedatageneratedandprocessedbymoderncomputingsystemsburgeonrapidly.MapReduceisanimportantprogrammingmodelforlargescaledataintensiveapplications.HadoopisapopularopensourceimplementationofMapReduceandGoogleFileSystem(GFS).Thescalabilityandfault-tolerancefeatureofHadoopmakesitasastandardforBigDataprocessing.HadoopusesHadoopDistributedFileSystem(HDFS)forstoringdata.Datareliabilityandfault-toleranceisachievedthroughreplicationinHDFS.Inthispaper,anewtechniquecalledDelaySchedulingBasedReplicationAlgorithm(DSBRA)isproposedtoidentifyandreplicate(dereplicate)thepopular(unpopular)files/blocksinHDFSbasedontheinformationcollectedfromthescheduler.Experimentalresultsshowthat,theproposedmethodachieves13%and7%improvementsinresponsetimeandlocalityoverexistingalgorithmsrespectively.IndexTerms—DynamicReplication,HDFS,DelayScheduling,HadoopMapreduceI.INTRODUCTIONAsdatagrowsrapidly,thecomplexityofprocessingbecomesachallenge.Applicationsareneedtoprocessverylargeamountofdataofdifferenttypeinshorttimetoachievebetteruserexperience.Toprovideabstracteddataservicestotheapplicationprograms,severalsolutionsareproposedrangingfromtraditionaldatabasestocurrentBigDatamanagementssystems.Theperformanceoftheapplicationismainlybasedonthesebackenddatamanagementsystems.Toenabledistributedprocessingwithhighavailability,fault-toleranceandloadbalancing,replicationmechanismistheevergreensolution.Ontheotherhand,maintainingconsistencyamongthereplicasindistributedenvironmentsisatimeconsumingprocesswhichinternaffectstheavailabilityandperformance.MostofthedatageneratedandprocessedbythecurrentBigDataapplicationsfollowthe‘writeonceandreadmany’patternswhicheliminatesthecomplexityofmaintainingconsistencyamongreplicas.RecentemergingdistributedfilesystemssuchasGoogleFileSystem(GFS)[1],HadoopDistributedFileSystems(HDFS)[2]usereplicationmechanismstoenablefaulttolerant,highperformanceparallelprocessing.Blindlyreplicatingallfiles/blocksatmanyplaceincreasestheavailabilityandfault-tolerance.Butwillincreasememoryrequirementproportionally.Findinghotspotandreplicatingthemmayyieldbetterperformancewithlessdemandonmemory.Determiningoptimalnumberofreplicaisachallengingandanactiveresearchproblemforalongtimeasitaddressesapplicationload,datasizeandqualityofservice,etc.Currentdistributedcomputingenvironmentssuchasgridcomputing,cloudcomputingaredesignedtoprocesspetabytesofdatainamassivelyparallelstyle.Asprocessingspeedincreasesrapidlywithadventofmulticoreprocessors,theunderlyingfilesystemsdeterminetheperformanceofcomputingenvironments.Tosupportstreamlikedataaccess,modernfilesystems(Bigtable[3],Cassandra[4])useverysimpledatamodelsupportinglimitednumberofoperations.SomeofthepopulardistributedfilesystemsandHadoop[5]isanemergingopensourceplatformforparalleldataprocessingforlargescaledataintensiveapplicationssupportedbyHDFS.Inthispaper,anewtechniquecalledDelaySchedulingBasedReplicationAlgorithm(DSBRA)isproposedtoidentifyandreplicatethepopularfiles/blocks(hotspots)inHDFSusingtheinformationcollectedfromDelaySchedulingtechnique.Theperformanceofproposedalgorithmisevaluatedbyexhaustiveexperiments.Itisobservedthat,itexcelsintermsofresponsetime,localityandfairness.Thepaperisorganizedasfollows:Section2givesbackgroundonHadoopandHDFS.Section3isdedicatedtorelatedworks.Section4elaboratestheproposedreplicationalgorithm.Sections5describethesimulationenvironmentanddiscussthesimulationresults.Section6concludesthepaperandhighlightsthefutureresearchdirections.II.HADOOPANDHDFSBACKGROUNDHadoopisapopularparallelprocessingframeworkforcloudenvironments.ItisanopensourceimplementationofMapReduce[6]andGFS[1].Duetosimplicityandscalabilityitbecomesade-factostandardfordata-intensiveapplications.HadoopprovidesanabstracteddistributedfaulttolerantenvironmentforBigDataprocessing.Thejobssubmittedtothesystemaredivided74DelaySchedulingBasedReplicationSchemeforHadoopDistributedFileSystemCopyright©2015MECSI.J.InformationTechnologyandComputerScience,2015,04,73-78intosmalltasksandexecutedparallellyonaclusterofcommodityhardwaremachines.Hadoopadoptsthemasterslavearchitecture.Usersneedtowriteonlytwofunctions:mapandreducefortheirapplications.Allotheroperationssuchassynchronization,parallelizationandhandlingfailuresarehandledbytheframework.Hadoopcontainstwomajorcomponents:(i)MapReduceisaruntimeenvironmentforparallelprocessingand(ii)HDFSisadistributedfilesystemforstoringinputandoutputfiles.MapReducehastwomajorcomponents:JobtrackerandTasktraker.Jobtrackeristhemastercomponenttokeeptrackofall