IntroductiontoDeepReinforcementLearningYen-ChenWu2015/12/11Outline•ReinforcementLearning•MarkovDecisionProcess•HowtoSolveMDPs–DP–MC–TD–Q-learning(DQN)•PaperReviewREINFORCEMENTLEARNINGBranchesofMachineLearningWhatmakesdifferent?•Thereisnosupervisor,onlyarewardsignal•Feedbackisdelayed,notinstantaneous•Timereallymatters(sequential,noni.i.ddata)•Agent’sactionsaffectthesubsequentdataitreceivesGoal:MaximizeCumulativeReward•Actionsmayhavelongtermconsequences•Rewardmaybedelayed•Itmaybebettertosacrificeimmediaterewardtogainmorelong-termrewardAgent&Enviroment→←↑↓DefenseAttackJumpMARKOVDECISIONPROCESSMarkovProcessesMarkovRewardProcessesMarkovDecisionProcessesMarkovProcessMarkovRewardProcessesMarkovDecisionProcessMarkovDecisionProcess(MDP)•S:finitesetofstates(observations)•A:finitesetofactions•P:transitionprobability•R:immediatereward•γ:discountfactor•Goal:–Choosepolicyπ–Maximizeexpectedreturn:HOWTOSOLVEMDPDynamicProgrammingMonte-CarloTemporal-DifferenceQ-LearningModel-based•DynamicProgramming–Evaluatepolicy–UpdatepolicyModelFree•UnknownTransitionProbability&Reward•MCvsTDModelFree:Q-learning•Insteadoftabular•optimalaction-valuefunction(Q-learning)–=•BellmanequationBasicidea:iterativeupdate(lackofgeneralization)Inpractical:functionapproximatorLinear?UsingDNN!DEEPQ-NETWORK(DQN)Video•=LJ4oCb6u7kkDeepQ-Network•computeQ-valuesforallactionsInput:84x84x4Convolves32filtersof8x8withstride4Convolves64filtersof4x4withstride2Convolves64filtersof3x3withstride1Full-connected512nodesOutputanodeforeachactionUpdateDQN•Lossfunction•GradientTwoTechnique•ExperienceReplay–Experience–PooledMemory•Dataefficiency(bootstrap)•Avoidcorrelationbetweensamples(variancebetweenbatches)•Off–policyissuitableforQ-learning–Randomsampledmini-batch–Prioritizedsweeping(activelearning)•SeparateTargetNetwork–morestablethanonlinelearningExampleLearnthevalueof…Pros&ConsOn-policySARSApolicybeingcarriedoutbytheagentFastbutweakOff-policyDQNoptimalpolicyindependentlyoftheagent'sactionsSlowbutrobustDEMOPAPERREVIEWPaperlist•MassivelyParallelMethodsforDeepReinforcementLearning•Continuouscontrolwithdeepreinforcementlearning•DeepReinforcementLearningwithDoubleQ-learning•PolicyDistillation•DuelingNetworkArchitecturesforDeepReinforcementLearning•MultiagentCooperationandCompetitionwithDeepReinforcementLearningMassivelyParallelMethodsforDeepReinforcementLearningArunNairarXiv:1507.04296DDPG(DeterministicPolicyGradient)•DDAC(DeepDeterministicActor-Critic)ContinuouscontrolwithdeepreinforcementlearningTimothyP.LillicraparXiv:1509.02971•SofttargetDuelingNetworkMultiagent