AData–ParallelImplementationofO(N)HierarchicalN–bodyMethodsYuHuandS.LennartJohnsson*DivisionofAppliedSciences,HarvardUniversityCambridge,Massachusetts02138*DepartmentofComputerSciences,UniversityofHoustonHouston,Texas77204-3475Email:hu@das.harvard.edu,johnsson@cs.uh.eduAbstractTheO(N)hierarchicalN–bodyalgorithmsandMassivelyParallelProcessorsallowparticlesystemsof100millionparticlesormoretobesimulatedinacceptabletime.Wepresentadata–parallelimplementationofAnderson’smethodanddemonstratebothefficiencyandscalabilityoftheimplementationontheConnectionMachineCM–5/5Esystems.Thecommunicationtimeforlargeparticlesystemsamountstoabout10–25%,andtheoverallefficiencyisabout35%.Theevaluationofthepotentialfieldofasystemof100millionparticlestakes3minutesand15minutesona256nodeCM–5E,givingexpectedfourandsevendigitsofaccuracy,respectively.Thespeedofthecodescaleslinearlywiththenumberofprocessorsandnumberofparticles.Keywords:N–bodysimulation,multipolealgorithms,hierarchicalN–bodymethods,data–parallelpro-gramming,massivelyparallelprocessors.1IntroductionTheproblemofcomputingtheforce(orthepotential)exertedononeanotherbyasystemofelectricalcharges(ormassesinteractinggravitationally)hasbeenwidelystudiedandhasapplicationsinareassuchascelestialmechanics,plasmaphysicsandmoleculardynamics.AlgorithmsthatcomputetheforcesforasystemofNparticlesinO(N)operationshavebeendevised[13,14,12,6,36,1].Theconstantofproportionalityisintherange1,000–10,000.Earlierhierarchicalalgorithms,suchasthoseproposedbyAppel[2]andbyBarnesandHut[4],werebelievedtohaveanarithmeticcomplexityofO(NlogN)anddidnothavearigorouserrorbound,althoughAppel’sversionwaslaterprovedtobeofO(N)[10].ThetwomethodswerelaterextendedtobeofO(N)withanalyticalerrorboundsandbycombiningwiththeideaofmultipoleexpansions[11,35].ParallelimplementationoftheO(NlogN)orO(N)hierarchicalN–bodymethodshavebeenofgreatinterestsasMassivelyParallelProcessors(MPPs)offertheprimarystorageandcomputepowerforsimulationofsystemswithseveralhundredmillionparticlesbyusingthesefastalgorithms.Table1givesasummaryofsequentialandparallelimplementationsofhierarchicalN–bodymethods.Incomparingperformance1AuthorMethod&errorProg.PerformanceModelNEffi.Cycles/PSystemparticleadaptiveO(NlogN)methodsSalmon[27]BH,quadrupoleMPNcubeWarren–Salmon[33]BH,quadrupoleMP8.78M26%180K512IntelDeltaWarren–Salmon[34]BH, 1=10 3MP8.78M28%266K512IntelDeltaWarren–Salmon[35]BH, 1=10 2MP2M111K256CM–5ELiu–Bhatt[24]BH,quadrupoleMP10M30%97K256CM–5Singhetal.[29]BHSMDASH,KSR–1nonadaptiveO(N)methodsLeathrum–Board[23]GR,p=8100K65%250K1RS/6000–360GR,p=8SM1M20%32KSR-1Elliott–Board[9]GR,FFT,p=8100K73%200K1RS/6000–360GR,FFT,p=8SM1M14%32KSR–1Schmidt–Lee[28]GR,p=840K39%312K1CrayYMP8/864GR,p=1640K22%1034K1CrayYMP8/864Zhao–Johnsson[37]Zhao,p=3DP16K12%560K8KCM–2Hu–JohnssonAnderson,D=5DP100M27%37K256CM–5E(thiswork)Anderson,D=14DP100M35%183K256CM–5EadaptiveO(N)methodsSinghetal.[30]GR,2–D,adapSMDASH,KSR–1Nylandetal.[26]GR,3–D,adapDPTable1:AsummaryofsequentialandparallelimplementationsofhierarchicalN–bodymethods.Allperformancenumbersareforuniformparticledistributions.Methodsusedareforthreedimensions,unlessotherwisestated. 1istheerrorboundperpartialaccelerationrelativetothemeanaccelerationofthesystem.Emptyentriesimplyunavailabledata.MP,SM,andDPareshortformessage–passing,shared–memory,anddata–parallel,respectively.resultsfromimplementationsofdifferentN–bodymethods,oftenwithdifferentparameters,ondifferentplatformsrunningatdifferentclockspeed,weproposetheuseofefficiencyoffloating–pointoperationsandcycles–per–particleasthestandardmeasure.Efficiencyaloneisinsufficientincomparingdifferalgorithmsthatrequiredifferentnumberofoperations.Cycles–per–particleincorporatesmachinesize,clockrate,arithmeticcomplexitiesofdifferentmethods,butitdoesnotdistinguishnodalarchitecture,e.g.,superscalararchitecturescanperformmultipleoperationspercycle.BarnesandHut’sO(NlogN)methodhasbeenimplementedusingthemessagepassingprogrammingparadigmbySalmonandWarren[27,33,34]ontheIntelTouchstoneDeltaandbyLiuandBhatt[24]ontheCM–5.Bothgroupsusedassemblylanguagefortimecriticalkernels.SalmonandWarrenachievedefficienciesintherange24–28%,whileLiuandBhattachieved30%efficiency.Recently,WarrenandSalmon[35]extendedtheircodetoincorporatemultipoleandlocalexpansionsandmadeitportabletoavarietyofparallelmachines.FornonadaptiveO(N)methods,GreengardandGropp[12]implementedGreengard–Rokhlin’smethodin2–Donashared–memorymachine(theEncoreMultimax320),butdataisnotsufficientlycompleteforinclusioninTable1.ZhaoandJohnsson[37]developedadata–parallelimplementationontheCM–2ofZhao’smethod,andachievedanefficiencyof12%forexpansionsinCartesiancoordinates,whichyieldsmorecostlymultipoleexpansioncalculationsthanpolarcoordinates.LeathrumandBoard[23,5]and2ElliottandBoard[9]achievedefficienciesintherange14–20%inimplementingFastFourierTransformacceleratedGreengard–Rokhlin’smethod[15]ontheKSR–1.SchmidtandLee[28]vectorizedthismethodfortheCrayY–MPandachievedanefficiencyof39%onasingleprocessor.Forcomparison,wehavealsoincludedtheresultsreportedinthispaper.Littleprogresshasbee