SchoolofElectronicandComputerEngineeringPekingUniversityWangWenminArtificialIntelligenceReinforcementLearningParadigmArtificialIntelligence5511.ParadigmsinMachineLearning11.1.SupervisedLearningParadigm11.2.UnsupervisedLearningParadigm11.3.ReinforcementLearningParadigm11.4.RelationsandOtherParadigmsContents:ArtificialIntelligence5611.3.ReinforcementLearningParadigm11.3.1.OverviewofReinforcementLearning11.3.2.TypesofReinforcementLearning11.3.3.NewAlgorithmsofReinforcementLearning11.3.4.ApplicationsofReinforcementLearningContents:ArtificialIntelligence::Learning::Paradigms57Inreinforcementlearning(RL),thelearnerisadecision-makingagent,thattakesactionsinanenvironmentandreceivesrewardsforitsactions.在强化学习中,其学习器是一个决策制定智能体,在环境下采取行动并获得这些动作的回报。Afterasetoftrial-and-errorruns,theagentshouldlearnthebestpolicy.经过一系列试错运行之后,该智能体能够学到最优策略。WhatisReinforcementLearning什么是强化学习11.3.1.OverviewofReinforcementLearningThepolicyistomaximizehisrewardoveracourseofactionsanditerationswiththeenvironment.该策略是经过一个阶段的动作以及与环境的交互之后,使其回报最大化。AgentA(Actions)S(State)T(Transition)R(Reward)EnvironmentArtificialIntelligence::Learning::Paradigms58ReinforcementLearningisinspiredbybehavioristpsychology.强化学习的灵感来自于行为心理学。Concernedwithhowagentstakeactionsinanenvironmentsoastomaximizesomenotionofcumulativereward.关注于智能体如何在环境中采取行动,为了使累积回报最大化。Duetoitsgenerality,theproblemisstudiedinmanyotherdisciplines,suchas:由于其普遍性,许多其他学科都研究这一问题,例如:gametheory,controltheory,operationsresearch,informationtheory,博弈论、控制论、运筹学、信息论、simulation-basedoptimization,multi-agentsystems,swarmintelligence,仿真优化、多智能体系统、群体智能、statisticsandgeneticalgorithms.统计学和遗传算法。WhatisReinforcementLearning什么是强化学习11.3.1.OverviewofReinforcementLearningArtificialIntelligence::Learning::Paradigms59Reinforcementlearningconsistsof:强化学习包含asetofagentstates,一组智能体的状态,st∈S;asetactionsoftheagent,一组智能体的动作,at∈A;atransitionfromstatestoactions,一个从状态到动作的转换函数,T(st,at,st+1);arewardfunction,一个回报函数,R(st,at,st+1).Tolookforapolicy,寻找一个策略,π(st).Don’tknowTorR尚未知道T或RI.e.don’tknowwhichstatesaregoodorwhattheactionsdo.即,不知道哪个状态好或者要做什么动作。Mustactuallytryactionsandstatesouttolearn.必须实际去尝试要学习的行动和状态。FormalizationofReinforcementLearning强化学习的形式化11.3.1.OverviewofReinforcementLearningArtificialIntelligence::Learning::Paradigms60Supervisedvs.Unsupervisedvs.ReinforcementLearning三种范式之比较11.3.1.OverviewofReinforcementLearningSupervisedlearning有监督学习Input/outputpairsarepresentedbylabeleddata(trainingexamples).通过标注数据(训练样本)提供输入和输出对儿。Learn-by-examples从样本中学习Unsupervisedlearning无监督学习Tofindthestructurehiddenincollectionsofunlabeleddata.发现无标注数据集中隐藏的结构。Learning-by-itself自我学习Reinforcementlearning强化学习Input/outputpairsareneverpresented,focusononlineperformance.不提供输入和输出对儿,专注于在线的性能优化。Online-learning在线学习ArtificialIntelligence6111.3.ReinforcementLearningParadigm11.3.1.OverviewofReinforcementLearning11.3.2.TypesofReinforcementLearning11.3.3.NewAlgorithmsofReinforcementLearning11.3.4.ApplicationsofReinforcementLearningContents:ArtificialIntelligence::Learning::Paradigms621)Model-based基于模型buildingamodeloftheenvironment.构建环境的模型。FirstactinginMarkovdecisionprocess(MDP)andlearningT,R;首先以马可夫决策过程方式动作,并学习T和R;ThendoingvalueiterationorpolicyiterationwithlearnedT,R.然后用学习的T和R进行数值迭代或策略迭代。2)Model-free无模型learningapolicywithoutanymodel.学习策略而没有任何模型。BypassingtheneedtolearnT,R,usingdirectevaluationpolicy.避开学习T和R的过程,采用直接评估策略。Prediction-basedtemporaldifference(TD)methods.基于预测的时间差分(TD)法。TypesofReinforcementLearning强化学习的类型11.3.2.TypesofReinforcementLearningArtificialIntelligence::Learning::Paradigms63Idea思想Learningthemodelempiricallythroughexperience.Andsolvingforvaluesasifthelearnedmodelwerecorrect.通过实践经验学习模型。若学到的模型正确,则用于数值求解。Simpleempiricalmodellearning简单的经验模型学习Countingoutcomesforeachs,a.对每个s和a,对结果进行计数。NormalizingtogiveestimateofT(st,at,st+1).对给定的估计T(st,at,st+1)做正则化处理。DiscoveringR(st,at,st+1)whenweexperience(st,at,st+1).当实践(st,at,st+1)时,去发现R(st,at,st+1)。SolvingMarkovdecisionprocesswiththelearnedmodel.用学到的模型求解马可夫决策过程。1)Model-basedReinforcementLearning基于模型的强化学习11.3.2.TypesofReinforcementLearningArtificialIntelligence::Learning::Paradigms64Actor-Criticmethods动作者·评判者方法TheTDversionofPolicyIteration(On-policy).策略迭代(On-policy)的时间差分版。Astructuretoexplicitlyrepresentpolicyindependentofvaluefunction.一种明确表示独立于价值函数的策略的结构。Policy(actor),isusedtoselectactions.策略(动作者)用于选择动作。Valuefunction(critic),usedtoevaluateactionsmadebyactor.价值函数(评判者)用于评估动作者所完成的动作。2)Model-freeReinforcementLearning无模型强化学习11.3.2.TypesofReinforcementLearningTDerror:δt=rt+1+γV(St+1)–V(St)Preference:p(st,at)←p(st,at)+βδtActor-criticmethods动作者·评判者方法ra’PolicyValuefunctionsAgentActorCriticRewardcriticizetheactionsActionStateTDerroraArtificialIntelligence::Learning::Paradigms65Q-learningTheTDversionofValueIteration(Off-policy).价值迭代(Off-policy)的时间差分版。IncrementallyestimateQ-valuesforactions,basedonrewardsandQ-valuefunction.基于回报值和Q-value函数,递增估计动作的Q值。UpdateruleisavariationofTDlearning,usingQ-valuesandabuilt-inmax-operatorovertheQ-valuesofthenextstate:更新规则是一种时间差分学习的变体,采用Q值与内置的下个状态Q值的最大运算符:2)Mode