HybridMPI+OpenMPApproachtoImprovetheScalabilityofaPhase-Field-CrystalCodeReubenD.Budiardjareubendb@utk.eduECSSSymposiumMarch19th,2013ProjectBackgroundProjectTeam(UniversityofMichigan):KatsuyoThornton(P.I.),VictorChanPhase-field-crystal(PFC)formulationtostudydynamicsofvariousmetalsystemsOriginalin-housecodewritteninC++Hasbeenrunin2Dand3DsystemsSolvesmultipleHelmholtzequations,areduction,thenanexplicittimestep2SolvingtheHelmholtzEquations𝛻2∅+𝒌𝟐∅=𝟎OriginallyusedGMRESwithAlgebraicMultigrid(AMG)preconditionerfromHYPREIn3D,discretizationmatrixislargeandmaybecomeindefinitedifficulttosolve,requiringlargeiterationsPoorweak-scalingresultsProhibitivelylongforindefinitematrixcaseIncreasingmemoryrequirementswithiteration3GoalScalabletosolvelargerproblem–Weakscaling:maintainthetime-to-solutionwithincreasingnumberofprocessesandafixedproblemsizeperprocessDecreasethetimetosolutionto1sec/timestep–Strongscaling:decreasetime-to-solutionwithincreasingnumberofprocessandafixedproblemsize–Exploitotherparallelism(withOpenMP?)–Investigatebetterpreconditioner–Differentmethod(library?)tosolvetheequations4GoalScalabletosolvelargerproblem–Weakscaling:maintainthetime-to-solutionwithincreasingnumberofprocessesandafixedproblemsizeperprocessDecreasethetimetosolutionto1sec/timestep–Strongscaling:decreasetime-to-solutionwithincreasingnumberofprocessandafixedproblemsize–Exploitotherparallelism(withOpenMP?)–Investigatebetterpreconditioner–Differentmethodtosolvetheequations5ComplexIterativeJacobiSolverHadley,G.R,AcomplexJacobiiterativemethodfortheindefiniteHelmholtzEquation,J.Comp.Phys.203(2005)358-370ReplacedHYPREAmodificationofstandardJacobimethod𝐻𝑛+1𝐻𝑛,Δ𝑙𝑖,𝛿𝑖2,𝛿𝑖2iscomputedwithcentered-differenceEasilyparallelizedandlowmemoryrequirementConvergenceratedependsonresolution,butroughlyconstantfromproblemtoproblemlargerproblem(withsimilarresolution)shouldnotincreaseiterations.6ComplexIterativeJacobiSolverHadley,G.R,AcomplexJacobiiterativemethodfortheindefiniteHelmholtzEquation,J.Comp.Phys.203(2005)358-370ReplacedHYPREAmodificationofstandardJacobimethod𝐻𝑛+1𝐻𝑛,Δ𝑙𝑖,𝛿𝑖2,𝛿𝑖2iscomputedwithcentered-differenceEasilyparallelizedandlowmemoryrequirementConvergenceratedependsonresolution,butroughlyconstantfromproblemtoproblemlargerproblem(withsimilarresolution)shouldnotincreaseiterations.7Adraftversionwasquicklyimplementedbytheprojectteam(VictorChan)andtested.ProfilingtheCodewithCrayPATMeasurebeforeoptimizeCanusesamplingortracingUsingCrayPATissimple:loadmodule,re-compile,buildinstrumentedcode,re-runCayPATcantraceonlyspecifiedgroup,e.g.mpi,io,heap,fftw,...moduleloadperftoolsmakecleanmakepat_build–gmpipfc_jacobi.exeaprun–n48pfc_jacobi.exe+patpat_report–oprofile.txt\output_data.xf8ProfilingtheCodewithCrayPATMeasurebeforeoptimizeCanusesamplingortracingUsingCrayPATissimple:loadmodule,re-compile,buildinstrumentedcode,re-runCayPATcantraceonlyspecifiedgroup,e.g.mpi,io,heap,fftw,...moduleloadperftoolsmakecleanmakepat_build–gmpipfc_jacobi.exeaprun–n48pfc_jacobi.exe+patpat_report–oprofile.txt\output_data.xf9ThatShouldHaveWorked!CrayPATWorkaroundUsetheAPIfor“finegrain”instrumentationAdd“PAT_region_{begin/end}”callstomostsubroutinesAfternarroweddowntoacouplemajorsubroutines,splitlabelsto“computation”and“communication”#includepat_api.h...voidComplex_Jacobi(…){...intPAT_ID,ierr;PAT_ID=41;ierr=PAT_region_begin(PAT_ID,communication);MPI_Internal_Communicate(…);MPI_Boundary_Communicate(…)ierr=PAT_region_end(PAT_ID);PAT_ID=42;ierr=PAT_region_begin(PAT_ID,computation);for(inti=1;isize.L1+1;i++){for(intj=1;jsize.L2+1;j++){for(intk=1;ksize.L3+1;k++){residual(i,j,k)=(1.0/D)*(...);}}}ierr=PAT_region_end(PAT_ID);10CrayPATWorkaroundUsetheAPIfor“finegrain”instrumentationAdd“PAT_region_{begin/end}”callstomostsubroutinesAfternarroweddowntoacouplemajorsubroutines,splitlabelsto“computation”and“communication”#includepat_api.h...voidComplex_Jacobi(…){...intPAT_ID,ierr;PAT_ID=41;ierr=PAT_region_begin(PAT_ID,communication);MPI_Internal_Communicate(…);MPI_Boundary_Communicate(…)ierr=PAT_region_end(PAT_ID);PAT_ID=42;ierr=PAT_region_begin(PAT_ID,computation);for(inti=1;isize.L1+1;i++){for(intj=1;jsize.L2+1;j++){for(intk=1;ksize.L3+1;k++){residual(i,j,k)=(1.0/D)*(...);}}}ierr=PAT_region_end(PAT_ID);11CommunicationsubroutineeventuallydominateatcertainMPIsizeCellUpdateandMPICommunication12Stepn:ComputedifferencesandupdatecellvaluesStepn:Communicateupdatedvaluetoneighboringghostcells(usingMPI_Sendrecv())Stepn+1:ComputedifferencesandupdatecellvaluesIterationsThereisafixedcommunicationcostineveryiteration.Canwehideit?rr+1r+2HidingCommunicationCost13Stepn:PostMPI_Irecv()forghostcells.Computedifferencesandupdatecellvaluesonsurfacecells.Stepn:SendsurfacecellsvaluewithMPI_Isend().Computedifferencesandupdatecellvaluesoninnercells.Stepn+1:Computedifferencesandupdatecellvaluesonsurfacecells.PostMPI_Irecv()forghostcells.Iterationsrr+1r+2HidingCommunicationCost14Stepn:Computedifferencesandupdatecellvaluesonsurfacecells.Pos