PAPI3.0.8.1onBlueGeneLUsingnetworkperformancecounterstolayouttasksforimprovedperformancePresentationoverviewProjectobjectivesPAPIexplanationBlueGeneLexplanationCurrentstateofresearchProjectobjectivesUpgradePAPIonBG/LProvideinterfacefornetworkcountersAllowLawrenceLivermoreNationalLabuserstoalsohaveaccesstoPAPIUsingnetworkcounterstoplacetasksoptimallyonBG/LPAPI–IntroCourtesyof–IntroPAPIusefultoprofileyourownprograms.ManytoolsbasedonPAPIPapiEx–CommandlinemeasurementtoolPerfSuite–AggregatemeasurementandstatisticalprofilingpackageandAPIHPCToolkit–StatisticalprofilingpackageManymore!PAPI–SupportedplatformsIBM–POWER3,604,604e,POWER4CrayT3E,CrayX1AMD–Athlon,OpteronIntel–P1toP4,ItaniumIandIIUltraSparcI,II&IIIMIPSR10K,R12K,R14KAlphaPAPI–GenericInterfaceCallsequenceforgenericinterfacePAPI_library_init–InitializememoryforPAPI’sdatastructuresPAPI_create_eventset–CreateanemptylistofeventsPAPI_add_event–AddeventstobecountedPAPI_start–BegincountingalleventswithinthespecifiedeventsetPAPI_stop–StopallcountersandreadtheircurrentvaluesPAPI–Events:PresetsPresets–listofpredefinedeventsimplementedonallsystemswheretheycanbesupportedNotallpresetsavailableoneveryarchitecture(e.g.BG/LhasnocachelowerthanL3–thusL1cachehitpresetnotapplicable)NativeeventsformthebasicbuildingblocksforPAPIpresetsPAPI–Events:PresetsCourtesyof–Events:NativeInadditiontothepredefinedPAPIpresetevents,thePAPIlibraryalsoexposesamajorityoftheeventsnativetoeachplatformCanbeaddedtoeventsetsinthesamemanneraspresetsPAPI–Events:NativePAPI–InternalsArrayofeventsetsisthemainportionPAPI–OtherfeaturesMultiplexing–IftherearenotenoughhardwarecountersThreadsafe–ProfilingisthreadsafeOverflowdetection–HardwarecountershavelimitedspacePAPI–PAPI2vsPAPI3PAPI3significantlyreducedoverheadsforstarting,stoppingandreadingthecountersCourtesyof–PAPI2vsPAPI3BetternativeeventsupportinPAPI3BetterthreadsupportinPAPI3OverflowandProfilingenhancementsinPAPI3MyriadbugfixesandcodecleanupinPAPI3PAPI–PAPI2vsPAPI3OverlappingeventsetssupportedinPAPI2MinorchangesintheAPI–mostlydereferencingvariablesBlueGeneL–Intro65,536nodesconnectedin64x32x323DtorusNodesmadeupofPowerPC440embeddedprocessorsSmallerthanmostsupercomputersConsumeslesspowerBlueGeneLBlueGeneL-Networks3Dtorusnetwork(nodetonode)Treenetwork(broadcasts)BlueGeneL–HWcounters48universalperformancecounters4floatingpointunitcountersCounters32bit–mustusevirtualcounterstopreventoverflowBlueGeneL–HWcountersResearch–OverallgoalsNetworkhardwarecountersnewUsenetworkcounterstodeterminetrafficbetweentasksTrytooptimizeplacementoftaskstominimizecommunicationlatencyGivencountsanddistances:cost=counts*distance.MinimizeoverallnodesResearch–CountingFirstgoaltodeterminewhatisbeingcountedResearch–NetworksForeachMPIcall–determinewhichnetworkcountersarebeingusedTreeissupposedtobeforbroadcastsTorusissupposedtobeforpointtopointcommunicationAmbiguitiesinthespecificationResearch–FuturedecisionsHowtoprofileatargetapplicationManuallyinsertPAPIinstrumentation:alotofworkInstrumentbinarieswithcountingcodeWhatinformationtostoreAllcountsoneachnode:alotofdataSampleofallnodes:notasaccurate(whatifthetasksbehave/communicatedifferently?Research–FuturedecisionsHowtousecollectedinformationProfileanapplicationtoobtaincounterfeedbacktodetermineoptimizedstatictasklayoutDynamicallymigratetasksinresponsetocounters