An efficient parallel algorithm for O(N^2) direct

整理文档很辛苦,赏杯茶钱您下走!

免费阅读已结束,点击下载阅读编辑剩下 ...

阅读已结束,您可以下载文档离线阅读编辑

资源描述

arXiv:astro-ph/0108412v127Aug2001AnefficientparallelalgorithmforO(N2)directsummationmethodanditsvariationsondistributed-memoryparallelmachinesJunichiroMakinoDepartmentofAstronomy,SchoolofScience,UniversityofTokyo,7-3-1Hongo,Bunkyo-ku,Tokyo113-0033,Japan.AbstractWepresentanovel,highlyefficientalgorithmtoparallelizeO(N2)directsumma-tionmethodforN-bodyproblemswithindividualtimestepsondistributed-memoryparallelmachinessuchasBeowulfclusters.Previouslyknownalgorithms,inwhichallprocessorshavecompletecopiesoftheN-bodysystem,hastheseriousproblemthatthecommunication-computationratioincreasesasweincreasethenumberofprocessors,sincethecommunicationcostisindependentofthenumberofproces-sors.Inthenewalgorithm,pprocessorsareorganizedasa√p×√ptwo-dimensionalarray.EachprocessorhasN/√pparticles,butthedataaredistributedinsuchawaythatcompletesystemispresentedifwelookatanyroworcolumnconsistingof√pprocessors.Inthisalgorithm,thecommunicationcostscalesasN/√p,whilethecalculationcostscalesasN2/p.Thus,wecanuseamuchlargernumberofprocessorswithoutlosingefficiencycomparedtowhatwaspracticalwithpreviouslyknownalgorithms.PACS:02.60.Cb;95.10.Ce;98.10.+zKeywords:Celestialmechanics,stellardynamics;Methods:numerical1IntroductionInthispaperwepresentanovelalgorithmtoparallelizethedirectsum-mationmethodforastrophysicalN-bodyproblems,eitherwithandwith-outtheindividualtimestepalgorithm.TheproposedalgorithmworksalsowiththeAhmad-Cohenneighborscheme(AhmadandCohen1973),orwithGRAPEspecial-purposecomputersforN-bodyproblems(Sugimotoetal.PreprintsubmittedtoElsevierPreprint1February20081990,MakinoandTaiji1998).Ouralgorithmisdesignedtoofferbetterscal-ingofthecommunication-computationratioondistributed-memorymulti-computerssuchasBeowulfPCclusters(Sterlingetal.1999)comparedtotraditionalalgorithms.Thispaperwillbeorganizedasfollows.Insection2wedescribethetraditionalalgorithmstoparallelizedirectsummationmethodondistributed-memoryparallelcomputers,andthescalingofcommunicationtimeandcomputationaltimeasfunctionsofthenumberofparticlesNandnumberofprocessorp.ItwillbeshownthatforpreviouslyknownalgorithmsthecalculationtimescalesasO(N2/p),whilecommunicationtimeisO(N+logp).Thus,evenwithinfinitenumberofprocessorsthetotaltimepertimestepisstillO(N),andwecannotusemorethanO(N)processorswithoutlosingefficiency.O(N)soundslarge,butthecoefficientisrathersmall.Thus,itwasnotpracticaltousemorethan10processorsforsystemswithafewthousandparticles,ontypicalBeowulfclusters.Insection3wedescribethebasicideaofournewalgorithm.ItwillbeshownthatinthisalgorithmthecommunicationtimeisO(N/√p+logp).Thus,wecanuseO(N2)processorswithoutlosingefficiency.Thisimpliesalargegaininspeedforrelativelysmallnumberofparticlessuchasafewthousands.Wealsobrieflydiscusstherelationbetweenournewalgorithmandthehyper-systolicalgorithm(Lippertetal.1998).Inshort,thoughtheideasbehindthetwoalgorithmsareverydifferent,theactualcommunicationpatternsarequitesimilar,andthereforetheperformanceisalsosimilarforthetwoalgorithms.OuralgorithmshowsabetterscalingandalsoismucheasiertoextendtoindividualtimestepandAhmad-Cohenschemes.Insection4wediscussthecombinationofourproposedalgorithmandindivid-ualtimestepalgorithmandtheAhmad-Cohenscheme.Insection5,wepresentexamplesofestimatedperformance.Insection6wediscussthecombinationofouralgorithmwithGRAPEhardwares.Insection7wesumup.2TraditionalapproachesTheparallelizationofthedirectmethodhasbeenregardedsimpleandstraight-forward[see,forexample,(Foxetal.1994)].However,itisonlysoifNpandifweusesimpleshared-timestepmethod.Inthissection,wefirstdiscussthecommunication-calculationratioofpreviouslyknownalgorithmsforthesharedtimestepmethod,andthenthoseforindividualtimestepalgorithmwithandwithouttheAhmad-Cohenscheme.22.1SharedtimestepMostofthetextbooksandpapersdiscusstheringalgorithm.SupposewecalculatetheforceonNparticlesusingpprocessors.Weconnecttheprocessorsinaonedimensionalring,anddistributeNparticlessothateachprocessorhasN/pparticles(figure1).Hereandhereafter,weassumethatNisintegermultipleofp,tosimplifythediscussion.TheringalgorithmcalculatestheforcesonNparticlesinthefollowingsteps.(1)EachprocessorcalculatestheinteractionsbetweenN/pparticleswithinit.CalculationcostofthisstepisCf(N/p)2/2,whereCfisthetimetocalculateinteractionbetweenonepairofparticles.(2)Eachprocessorsendsallofitsparticlestothesamedirection.Herewecallthatdirection“right”.Thusallprocessorssendsitsparticlestotheirrightneighbors.ThecommunicationcostisCcN/p+Cs,whereCcisthetimetosendoneparticletotheneighboringprocessorandCsisthestartuptimeforcommunication.(3)Eachprocessoraccumulatestheforcefromparticlestheyreceivedtoitsownparticles.CalculationcostisCf(N/p)2.Ifforcefromallparticlesisaccumulated,gotostep5.(4)Eachprocessorthensendstheparticlesitreceivedintheprevioussteptoitsrightneighbor,andgoesbacktopreviousstep.(5)Forcecalculationcompleted.ThetimeforactualcalculationisgivenbyTf,ring=CfN2/p,(1)andthecommunicationtimeTc,ring=CcN+Csp.(2)ThetotaltimeperonetimestepofthisalgorithmisTring=Tf,ring+Tc,ring=CfN2/p+CcN+Csp.(3)Here,weneglectsmallcorrectionfactorsoforderO(1/p).For

1 / 20
下载文档,编辑使用

©2015-2020 m.777doc.com 三七文档.

备案号:鲁ICP备2024069028号-1 客服联系 QQ:2149211541

×
保存成功