英语学术论文作业HybridParallelProgrammingonGPUClustersAbstract—Nowadays,NVIDIA’sCUDAisageneralpurposescalableparallelprogrammingmodelforwritinghighlyparallelapplications.Itprovidesseveralkeyabstractions–ahierarchyofthreadblocks,sharedmemory,andbarriersynchronization.ThismodelhasprovenquitesuccessfulatprogrammingmultithreadedmanycoreGPUsandscalestransparentlytohundredsofcores:scientiststhroughoutindustryandacademiaarealreadyusingCUDAtoachievedramaticspeedupsonproductionandresearchcodes.Inthispaper,weproposeahybridparallelprogrammingapproachusinghybridCUDAandMPIprogramming,whichpartitionloopiterationsaccordingtothenumberofC1060GPUnodesinaGPUclusterwhichconsistsofoneC1060andoneS1070.LoopiterationsassignedtooneMPIprocessareprocessedinparallelbyCUDArunbytheprocessorcoresinthesamecomputationalnode.Keywords:CUDA,GPU,MPI,OpenMP,hybrid,parallelprogrammingI.INTRODUCTIONNowadays,NVIDIA’sCUDA[1,16]isageneralpurposescalableparallelprogrammingmodelforwritinghighlyparallelapplications.Itprovidesseveralkeyabstractions–ahierarchyofthreadblocks,sharedmemory,andbarriersynchronization.ThismodelhasprovenquitesuccessfulatprogrammingmultithreadedmanycoreGPUsandscalestransparentlytohundredsofcores:scientiststhroughoutindustryandacademiaarealreadyusingCUDA[1,16]toachievedramaticspeedupsonproductionandresearchcodes.InNVDIAtheCUDAchip,alltothecoreofhundredsofwaystoconstructtheirchips,inherewewilltrytouseNVIDIAtoprovidecomputingequipmentforparallelcomputing.Thispaperproposesasolutiontonotonlysimplifytheuseofhardwareaccelerationinconventionalgeneralpurposeapplications,butalsotokeeptheapplicationcodeportable.Inthispaper,weproposeaparallelprogrammingapproachusinghybridCUDA,OpenMPandMPI[3]programming,whichpartitionloopiterationsaccordingtotheperformanceweightingofmulti-core[4]nodesinacluster.BecauseiterationsassignedtooneMPIprocessareprocessedinparallelbyOpenMPthreadsrunbytheprocessorcoresinthesamecomputationalnode,thenumberofloopiterationsallocatedtoonecomputationalnodeateachschedulingstepdependsonthenumberofprocessorcoresinthatnode.Inthispaper,weproposeageneralapproachthatusesperformancefunctionstoestimateperformanceweightsforeachnode.Toverifytheproposedapproach,aheterogeneousclusterandahomogeneousclusterwerebuilt.Inourimplementation,themasternodealsoparticipatesincomputation,whereasinpreviousschemes,onlyslavenodesdocomputationwork.Empiricalresultsshowthatinheterogeneousandhomogeneousclustersenvironments,theproposedapproachimprovedperformanceoverallpreviousschemes.Therestofthispaperisorganizedasfollows.InSection2,weintroduceseveraltypicalandwell-knownself-schedulingschemes,andafamousbenchmarkusedtoanalyzecomputersystemperformance.InSection3,wedefineourmodelanddescribeourapproach.OursystemconfigurationisthenspecifiedinSection4,andexperimentalresultsforthreetypesofapplicationprogramarepresented.ConcludingremarksandfutureworkaregiveninSection5.II.BACKGROUNDREVIEWA.HistoryofGPUandCUDAInthepast,wehavetousemorethanonecomputertomultipleCPUparallelcomputing,asshowninthelastchipinthehistoryofthebeginningoftheshowdoesnotneedalotofcomputation,thengraduallytheneedforthegameandeventhegraphicswereandtheneedfor3D,3Dacceleratorcardappeared,andgraduallywebegantodisplaychipforprocessing,begantoshowseparatechips,andevenmadeasimilarintheirCPUchips,thatisGPU.WeknowthatGPUcomputingcouldbeusedtogettheanswerswewant,butwhydowechoosetousetheGPU?ThisslideshowsthecurrentCPUandGPUcomparison.First,wecanseeonlyamaximumofeightcoreCPUnow,buttheGPUhasgrownto260core,thecorenumber,we'llknowalotofparallelprogramsforGPUcomputing,despitehisrelativelylowfrequencyofcore,weIbelievealargenumberofparallelcomputingpowercouldbeweakerthanasingleissue.Next,weknowthattherearewithintheGPUmemory,andmoreaccesstomainmemoryandGPUCPUGPUaccessonthememorycapacity,wefindthatthespeedofaccessingGPUfasterthanCPUby10times,awholeworse90GB/s,Thisisquitealarminggap,ofcourse,thisalsomeansthatwhencomputingthetimerequiredtoaccesslargeamountsofdatacanhaveagoodGPUtoimprove.CPUusingadvancedflowcontrolsuchasbranchpredictordelaybranchandalargecachetoreducememoryaccesslatency,andGPU'scacheandarelativelysmallnumberofflowcontrolnorhissimple,sothemethodistousealotofGPUcomputingdevicestocoveruptheproblemofmemorylatency,thatis,assuminganaccessmemoryGPUtakes5secondsofthetime,butifthereare100threadsimultaneousaccessto,thetimeis5seconds,buttheassumptionthatCPUtimememoryaccesstimeis0.1seconds,ifthe100threadaccess,thetimeis10seconds,therefore,GPUparallelprocessingcanbeusedtohideeveninaccessmemorythanCPUspeed.GPUisdesignedsuchthatmoretransistorsaredevotedtodataprocessingratherthandatacachingandflowcontrol,asschematicallyillustratedbyFigure1.Therefore,weinthearithmeticlogicbyGPUadvantage,tryingtouseNVIDIA'smulti-coreavailabletohelpusalotofcomputation,andwewillprovideNVIDIAwithsomanycoreprograms,andNVIDIACorporationtoprovidetheAPIofparallelprogramminglargenumberofoperationstocarryout.WemustusetheformprovidedbyNVIDIACorporationGPUcomputingtorunit?Notreally.WecanuseNVIDIACUDA,ATICTMandapplemadeOpenCL(OpenComputingLanguage),isthedev