基于开放标准OpenCL的深度学习研究和探索谷俊丽AMDResearchCollaboratedwithproductteamJunli.Gu@AMD.comOutline深度学习及其发展状况深度学习对系统实现的挑战基于OpenCL的深度学习探索DNN模型WhatisaDeepNeuralNetwork(DNN)?‒3~24hiddenlayers,millionstobillionsofparameters‒DNN+BigDataisleadingrecentdirectioninmachinelearningRichVarietiesofDNNStructures‒MLP(Multi-levelPerceptron)/AutoEncoder‒CNN(ConvolutionalNeuralNetwork)‒DBN(Deepbeliefnetwork)/RBM(RestrictedBoltzmannMachine)DeepLearningonDNNmodel‒Randominitializedparameters‒Trainedtoconvergebyfeedinglargescaleofdata(BigData)Startingtogetreallyhotafterwinning2012ILSVRCcompetitionneurons3|IMPLEMENTINGALEADINGLOADSPERFORMANCEPREDICTORONCOMMODITYPROCESSORS|JUNE19,2014weightedconnectionInputhidden3Outputhidden1hidden2OADSPERFORMANCEPREDICTORONCOMMODITYPROCESSORS|JUNE19,2014深度学习过程(DEEPLEARNING)AllfeaturesleaDrntebyetrpaininLgedaatar,wnitihnougthu!maninterference.Voice!,Text!Image!DNNforSpeech!10khoursofvoicedata!10btrainingsamples!MonthsonaGPUcluster!Results•DeepLearning:DNNmodel+BigData••ActuallyhumandefinedfeaturesnolongerworkwellforBigDatascenarioswithnoise.DNNmodel4|IMPLEMENTINGALEADINGL深度学习为何强大?HIERARCHICALFEATUREEXTRACTIONExtractfeatureslayerbylayerfrominputdata,toformhierarchicalrepresentationthatisbeyondhuman’sdefinitionFeatureshavesemanticmeanings5|AMDDNN深度学习正在引领潮流WhyinternetcompaniespurseDNNthesedays?‒Originalhumandefinedalgorithmsdon’tworkwellforBigData‒CompetinginmachinelearningtounderstandBigDataDNN(deepneuralnetworks)isbreakingthrough&leadingdirection‒Largescaleofimageclassification/recognition/search,facerecognition‒Onlinerecommendationforelectronicbusiness‒Voicerecognition,musicsearchetc.‒Eg.Imageclassificationaccuracy:74%in2011,93%tilltodayLong-terminvestmentbyindustry‒BAT,Google,Facebook,Yahoo,Microsoft,BankandFinance‒Google/Baidu/IBMBrainprojectDNN+BigDataisbelievedtobetheevolutionarytrendforapps&应用示例:以图搜图6|PRESENTATIONTsITLyEtDEeCEMmBER1s9,2.014|CONFIDENTIALs|深度学习对系统设计的挑战•Typicalscaleofdataset•Imagesearch:1M•OCR:100M•Speech:10B,CTR:100B•Projecteddatatogrowth10Xperyear•DNNmodeltrainingtime•WeekstomonthsonGPUclusters•TrainedDNNsthendeployedoncloud•Systemisthefinalenabler•Currentplatformrunsintobottleneck•CPUclustersCPU+GPUclusters•LookingatdGPUs,APUs,FPGAs,ASIC,etc.DNNmodelDNNcompute&memoryintensive,thusclustersBigDatainputAnswer7|AMDDNNAMDDNN深度学习将无所不在DPeNopleNwasnttEhevsaemercyodwetohruenornedif!ferentplatforms8|Supercomputers!Datacenters!Tablets,smartphones!Wearabledevices!IoTs!CredittoBaiduRenWuDeepLearningisappliedtotremendousapplicationscenariosandvariousdeviceplatformsDeeplearningsystemshouldconsiderCrossplatformcompatibility,portabilityOfflinetrainingDeployedoncloudOnlineonmobileswearablesandIoTs1000sGPUs!100k-1mservers!700m(inChina)!Billions?!PENCL开放标准AMDDNNOOpenCL-basedOpenECO-SYSTEM!O•penDCLiivseinrdsuestirnyd’suospternystpaandratircdifpoarthieotner,ofgreonmeoucseclolmpphuotinngestosupercomputers!9|*CourtesyofSimon&McIntosh-Smith&and&Tom&Deakin&Supportcrossplatformcompatibility,portabilityoProcessorvendors,systemOEMs,middlewarevendors,Webeliapplicationdevelopers.!•OOnepveenrsCioLnoisfctohdeesi,nydouucsatnryrusntoanndCPaUr,dGePUm,bAPrUac,aecdcebleyramtorasnfryomcoamllvpeanndoierss.!!BroadsupportfromdifferentcompaniesevedeeplearningsystemshouldbebuiltbasedonOpenCLAMDDNN:基于OPENCL的深度学习实现ProjectGoal:tackleDNNchallengesfromH/WtoSystemtoApplicationsLayer1H/W:Heterogeneousplatformimplementationandspeedup‒OpenCLimplementationandperformanceoptimizationsLayer2Systems:ScaleouttodistributedsystemsLayer3App.:DNN+BigDataapplications‒Layer3Layer110|DNNPROJECTApplications:Buildimageapps/demosSystems:DesignparallelschemeforclusterHeterogeneouscomputing:OpenCLimplementationandoptimizationsCPUGPUContextKernelsMemoryobjectsCommand-queueskernelvoiddp_mul(globalconstfloat*a,……dp_mulCPUprogrambinarydp_mularg[0]valuearg[1]valuearg[2]valueImagesBuffersIn-orderqueueOut-of-orderqueueOpenCLdeviceProgramsdp_mulGPUprogrambinaryCompilecodeSendtoexecutionCreatedataandargumentsOPENCL实现详细11|DNNPROJECTCPUGPUContextKernelsMemoryobjectsCommand-queueskernelvoiddp_mul(globalconstfloat*a,……dp_mulCPUprogrambinarydp_mularg[0]valuearg[1]valuearg[2]valueImagesBuffersIn-orderqueueOut-of-orderqueueOpenCLdeviceProgramsdp_mulGPUprogrambinaryCompilecodeSendtoexecutionCreatedataandargumentsOPENCL实现的挑战OpenCLusesruntimecompiling:ToallowvariousH/WdevicesandoptimizekernelsaccordinglyTradeoff:runtimecompilingtakescomputationtime12|DNNPROJECTCPUGPUContextKernelsMemoryobjectsCommand-queueskernelvoiddp_mul(globalconstfloat*a,……dp_mulCPUprogrambinarydp_mularg[0]valuearg[1]valuearg[2]valueImagesBuffersIn-orderqueueOut-of-orderqueueOpenCLdeviceProgramsdp_mulGPUprogrambinaryCompilecodeSendtoexecutionCreatedataandargumentsOPENCL实现的挑战HeavyOpenCLH/WdetailsDomainexpertusuallydon’tappreciatedhardwaredetails(devices,cache,memory,etc.)13|DNNPROJECTCPUGPUContextKernelsMemoryobjectsCommand-queueskernelvoiddp_mul(globalconstfloat*a,……dp_mulCPUprogrambinarydp_mularg[0]valuearg[1]valuearg[2]valueImagesBuffersIn-orderqueueOut-of-orderqueueOpenCLdevicePr