深度学习课件:深度强化学习

整理文档很辛苦,赏杯茶钱您下走!

免费阅读已结束,点击下载阅读编辑剩下 ...

阅读已结束,您可以下载文档离线阅读编辑

资源描述

IntroductiontoDeepReinforcementLearningYen-ChenWu2015/12/11Outline•ReinforcementLearning•MarkovDecisionProcess•HowtoSolveMDPs–DP–MC–TD–Q-learning(DQN)•PaperReviewREINFORCEMENTLEARNINGBranchesofMachineLearningWhatmakesdifferent?•Thereisnosupervisor,onlyarewardsignal•Feedbackisdelayed,notinstantaneous•Timereallymatters(sequential,noni.i.ddata)•Agent’sactionsaffectthesubsequentdataitreceivesGoal:MaximizeCumulativeReward•Actionsmayhavelongtermconsequences•Rewardmaybedelayed•Itmaybebettertosacrificeimmediaterewardtogainmorelong-termrewardAgent&Enviroment→←↑↓DefenseAttackJumpMARKOVDECISIONPROCESSMarkovProcessesMarkovRewardProcessesMarkovDecisionProcessesMarkovProcessMarkovRewardProcessesMarkovDecisionProcessMarkovDecisionProcess(MDP)•S:finitesetofstates(observations)•A:finitesetofactions•P:transitionprobability•R:immediatereward•γ:discountfactor•Goal:–Choosepolicyπ–Maximizeexpectedreturn:HOWTOSOLVEMDPDynamicProgrammingMonte-CarloTemporal-DifferenceQ-LearningModel-based•DynamicProgramming–Evaluatepolicy–UpdatepolicyModelFree•UnknownTransitionProbability&Reward•MCvsTDModelFree:Q-learning•Insteadoftabular•optimalaction-valuefunction(Q-learning)–=•BellmanequationBasicidea:iterativeupdate(lackofgeneralization)Inpractical:functionapproximatorLinear?UsingDNN!DEEPQ-NETWORK(DQN)Video•=LJ4oCb6u7kkDeepQ-Network•computeQ-valuesforallactionsInput:84x84x4Convolves32filtersof8x8withstride4Convolves64filtersof4x4withstride2Convolves64filtersof3x3withstride1Full-connected512nodesOutputanodeforeachactionUpdateDQN•Lossfunction•GradientTwoTechnique•ExperienceReplay–Experience–PooledMemory•Dataefficiency(bootstrap)•Avoidcorrelationbetweensamples(variancebetweenbatches)•Off–policyissuitableforQ-learning–Randomsampledmini-batch–Prioritizedsweeping(activelearning)•SeparateTargetNetwork–morestablethanonlinelearningExampleLearnthevalueof…Pros&ConsOn-policySARSApolicybeingcarriedoutbytheagentFastbutweakOff-policyDQNoptimalpolicyindependentlyoftheagent'sactionsSlowbutrobustDEMOPAPERREVIEWPaperlist•MassivelyParallelMethodsforDeepReinforcementLearning•Continuouscontrolwithdeepreinforcementlearning•DeepReinforcementLearningwithDoubleQ-learning•PolicyDistillation•DuelingNetworkArchitecturesforDeepReinforcementLearning•MultiagentCooperationandCompetitionwithDeepReinforcementLearningMassivelyParallelMethodsforDeepReinforcementLearningArunNairarXiv:1507.04296DDPG(DeterministicPolicyGradient)•DDAC(DeepDeterministicActor-Critic)ContinuouscontrolwithdeepreinforcementlearningTimothyP.LillicraparXiv:1509.02971•SofttargetDuelingNetworkMultiagent

1 / 30
下载文档,编辑使用

©2015-2020 m.777doc.com 三七文档.

备案号:鲁ICP备2024069028号-1 客服联系 QQ:2149211541

×
保存成功