TobepublishedinAmericanMathematicalSociety’sProc.intheDIMACSSeriesonDiscreteMathematicsandTheoreticalComputerSc.,Apr.1995.NewAnticipatoryLoadBalancingStrategiesforParallelA*Algorithms NiharR.MahapatraandShantanuDuttfmahapatra,duttg@ee.umn.eduDepartmentofElectricalEngineering,UniversityofMinnesota,Minneapolis,MN55455AbstractInthispaper,wedeveloploadbalancingstrategiesforscalablehigh-performanceparallelA*algorithmssuitablefordistributed-memoryma-chines.InparallelA*search,ine cienciessuchasprocessorstarvationandsearchofnon-essentialspaces(searchspacesnotexploredbythesequentialalgorithm)growwiththenumberofprocessorsPused,thusrestrictingitsscalability.Toalleviatethise ect,weproposeanovelpar-allelstartupphaseandane cientdynamicloadbalancingstrategycalledthequalityequalizing(QE)strategy.Ournewparallelstartupschemeexecutesoptimallyin (logP)timeand,inaddition,achievesgoodini-tialloadbalance.TheQEstrategyemploysnear-neighborquantitativeandqualitativeloadbalancingschemestoachieveloadbalance.Theseschemesutilizeanticipatorymechanismstodetectandcorrectloadim-balancebeforeitsactualoccurrence;suchmechanismsareparticularlyusefulatlowerworkdensities(theratiooftheproblemsizetoP)andforlowergranularityapplications.TheQEstrategypossessescertainuniqueloadbalancingpropertiesthatenableittosigni cantlyreducestarvationandnon-essentialwork,andthatmakeitsperformancero-bustacrossapplicationswithdi erentcostdistributionsforsearch-spacenodes.Consequently,weobtainahighlyscalableparallelA*algorithmwithanalmost-linearspeedup.ThestartupandloadbalancingschemeswereemployedinparallelA*algorithmstosolvetheTravelingSalesmanProblemonannCUBE2hypercubemulticomputer.TheQEstrategyyieldsaveragespeedupimprovementsofabout20-185%and15-120%atlowandintermediateworkdensities,respectively,overthreewell-knownloadbalancingmethods|theround-robin(RR),therandomcommu-nication(RC)andtheneighborhoodaveraging(NA)strategies.Theaveragespeedupobservedon1024processorsisabout985,representingaveryhighe ciencyof0:96.Wealsotestedthee ectofincludingananticipatoryqualitativeloadbalancingschemeintheQEstrategyandfoundthatitreducestheaverageexecutiontimeby3:32%and8:77%on ThisresearchwasfundedinpartbyaGrant-in-AidfromtheUniversityofMinnesotaandinpartbyNSFgrantMIP-9210049.SandiaNationalLabsprovidedaccesstotheir1024-processornCUBE2parallelcomputer.1256and512processors,respectively,atlowerworkdensities.Finally,wepresentanalyticalandempiricalresultsonthescalabilityofparallelA*algorithmsintermsoftheisoe ciencymetric.Ouranalyticalre-sultsinclude(1)a (P:logP)lowerboundontheisoe ciencyfunctionofanyparallelA*algorithm,and(2)ageneralexpressionfortheupperboundontheisoe ciencyfunctionofourparallelA*algorithmusingtheQEstrategyonanytopology|forthehypercubeand2-Dmeshar-chitecturestheupperboundsontheisoe ciencyfunctionarefoundtobe (P:log2P)and (P:pP),respectively.Experimentalresultsvalidateouranalysis,andalsoshowthatparallelA*searchusingtheQEloadbalancingstrategyhasbetterscalabilitythanwhenusingtheRR,RCorNAstrategies.1IntroductionTheA*algorithm[21]isawell-known,generalizedbranch-and-boundsearchprocedure,widelyusedinthesolutionofmanycomputationallydemandingcombinatorialoptimizationproblems(COPs)[4,23].Itsoperation,asde-tailedlater,canbeviewedessentiallyasabest- rstsearchofastatespacegraph.Parallelizationofbranch-and-boundmethodsprovidesane ectivemeanstomeetthecomputationalneedsofmanypracticalsearchproblems[3,8].Theaimofourworkistodevelopscalablehigh-performanceparallelA*algorithmsforsolvingCOPsondistributed-memorymachines.However,parallelizationofA*introducesanumberofine ciencies.(1)First,thetimerequiredinitiallytosplitthewholesearchspaceamongallPprocessors,i.e.,thestartupphasetime,canbeasigni cantfractionofthetotalexecutiontimeatlowworkdensities(theratiooftheproblemsizetoP).Thereforethestartupphaseneedstobeexecutede ciently.Also,itisdesirabletohaveagoodinitialloadbalancetoreduceidlingatthebeginningofparallelA*.(2)InsearchalgorithmssuchasA*,theamountofworkcorrespondingtodi er-entsearchsubspacesisverydi culttoestimateandcanvarywidely.Hencesomeformofdynamic,quantitativeloadbalancingiscrucialtoreducingtheidlingthatwouldotherwiseoccur.(3)Finally,processorsperformingbest- rstsearchoftheirlocalsubspacesinparallelA*maysearchspacesthatasequentialA*algorithmwillnotexplore.Thiscanleadtosubstantial\non-essentialwork.Toaddressthisproblem,itisimperativetoperformdynamicqualitativeloadbalancingsothatatalltimesdi erentprocessorssearchspacesthatarecomparablypromising.Inadditiontotheaboveine ciencies,duplicatedworkamongprocessorscanoccurwhenthesearchspaceisagraph.1Thisproblemcanbetackledbyusinge cientduplicatepruningtechniques[9,17,18].However,sincethefocusofthispaperisonloadbalancingstrategies,wewillrestrictour1Weusetheterm\graphmainlytodenotegraphsthatarenottrees,butsometimesweuseitmoregenerallytomeantreesaswell|thiswillbeclearfromthecontext.2attentiontotreesearchspacessothatperformancecomparisonofparallelA*algorithmsemployingdi erentloadbalancingmethodsre ectstheef-fectivenessofthesealgorithmsinachievingloadbalanc