软件调优基础陈健2003/3为什么需要调优?相同的代码不同的性能SELFRELEASEOPT:4IMSLCXMLATLASMKL50MKL5116.676s5.445s5.457s10.996s3.328s0.762s0.848s0.738sfor(i=0;iNUM;i++){for(j=0;jNUM;j++){for(k=0;kNUM;k++){c[i][j]=c[i][j]+a[i][k]*b[k][j];}}}for(i=0;iNUM;i++){for(k=0;kNUM;k++){for(j=0;jNUM;j++){c[i][j]=c[i][j]+a[i][k]*b[k][j];}}}目标明确性能调优的主要任务定义一些重要的性能调优术语利用Intel工具提供帮助AgendaPerformanceCycleOverview–ThePerformanceCycle–WhentoStart–PerformanceGains–WhentoStop–PuttingitintoPerspectivePerformanceCycleDetailsSummary调优循环分析数据并得出结论测试结果修改代码实现优化确定修改方法来解决问题从这里开始收集性能数据When(why)toStartUserRequirement?SoftwareVendorRequirement?PutPerformanceRequirementintotheRequirementsDocumentPerformanceshouldbeconsideredateverystageoftheproductlifecycle(RequirementsGathering,Design,andTesting)Exception:Do“codetuning”afterthesimple/readablenon-optimizedversionoftheapplicationexists.工作vs.效果EffortPerformacneTheoreticalPerformanceRequiredPerformancePerformanceAttainedwToolsPerformanceAttainedw/oToolsWhentoStopArchitectureisatMaximumEfficiency?Besureyouknowwhatthisis:CalculateTheoreticalMaximumPerformanceRequirementissatisfiedIncrementallydoWideMeshOptimizations2untildone调优原则Weshouldforgetaboutsmallefficiencies,sayabout97%ofthetime:prematureoptimizationistherootofallevil.DonaldKnuthQualityCodeis:–Portable–Readable–Maintainable–ReliableIntelligentlySacrificeQualityforPerformanceAgendaPerformanceCycleOverviewPerformanceCycleDetails–GatherPerformanceData–AnalyzeDataandIdentifyIssues–GenerateAlternativestoResolveIssues–ImplementEnhancementsSummary收集性能数据Timer–Usetogetwallclocktime–Accuracy,LowOverheadUseIntel®VTune™PerformanceAnalyzer–Profiler:GatherInformationaboutCodeUsage–PerformanceMonitor:GatherInformationaboutSystemResourceUsage工作量Agoodworkloadshouldhavethesecharacteristics:–measurable–reproducible–static–representative分析数据得出结论BaselineCurrentPerformanceExamineHotSpotsIdentifyBottlenecksCalculatePotentialMaximumPerformanceExamineHotSpotsTheParetoPrinciple,a.k.a.the80/20Rule–Concentrateonthevitalfewvs.thetrivialmanyHotSpot:应用或系统中占主要运算量的部分GenerallyconsistsofaLoopForApplicationsthatdon’thavehotspots,examine:–MemoryLayout–Exceptions–EffectiveCompilerUsage额外内容BigOUtilization,Efficiency,Throughput,LatencyBottlenecks–I/O,Memory,CPUMIPS/FLOPS/CPIConcurrency,ParallelismScalabilityLoads/StoresperCalculationAgendaPerformanceCycleOverviewPerformanceCycleDetails–GatherPerformanceData–AnalyzeDataandIdentifyIssues–GenerateAlternativestoResolveIssues–ImplementEnhancementsSummary优化设计层次问题定义系统结构算法和数据结构代码调优系统软件系统硬件代码调优汇编指令级内部函数C++向量类库多线程循环转化编译器及参数性能库HardesttodevelopandmaintainEasiesttodevelop,portandmaintainCodeTuningIfParallelProcessing–BreakAlgorithmupacrossClusters(DistributedMemory)–SingleNodeOptimization–BreakAlgorithmupacrossProcessors(SMP)修改代码实现优化UseIntel®LibrariesUseVariousCompilerSwitchesFindoutifthecompilerorhardwaredoestheenhancementsautomatically-beforeimplementingyourselfModifySource(i.e.LoopTransformations,SWP,SIMD,OpenMP,Intrinsics,Assembly)Test!MakesureApplicationsstillrunscorrectly(RegressionTesting)MakesureenhancementactuallyincreasesperformanceCalculateSpeed-upDecideifyou’redoneoptimizingSpeed-UpSpeed-Up=OptimizedTimeBaselineTimeSpeed-Up=OptimizedThroughputBaselineThroughputTheTwoBasicFormulasSummaryOptimizationTasks–GatherPerformanceData–AnalyzeData&IdentifyIssues–GenerateAlternativestoResolveIssue–ImplementEnhancements–TestResultsUseIntel®SoftwareDevelopmentToolsforeverystepintheprocess