人工智能原理-北京大学-8--PartIVPlanningChapter8Classicand-(8

整理文档很辛苦,赏杯茶钱您下走!

免费阅读已结束,点击下载阅读编辑剩下 ...

阅读已结束,您可以下载文档离线阅读编辑

资源描述

SchoolofElectronicandComputerEngineeringPekingUniversityWangWenminArtificialIntelligenceDecision-theoreticPlanningArtificialIntelligence::Planning::ClassicandReal-worldPlanning2Classicplanningistofindaplantoachieveitsgoalswithlowestcost.经典规划是寻找一个以最小代价到达其目标的计划。Decision-theoreticPlanningistofindaplantoachieveitsgoalswithmaximumexpectedutility(MEU).决策理论规划是寻找一个以最大期望效用(MEU)到达其目标的规划。WhatisDecision-theoreticPlanning什么是决策理论规划8.5.Decision-theoreticPlanningEnvironmentPerceptsActionsAgentSensorsActuatorsRewardArtificialIntelligence::Planning::ClassicandReal-worldPlanning3Example:GridWorld方格世界8.5.Decision-theoreticPlanningTransitionmodel:转换模型probability0.8:agentmovesup;概率0.8:智能体上移;probability0.1:agentmovesrightorleft;概率0.1:智能体左移、右移;nomovement:ifawallinthedirection;不移:若前方是堵墙;reward+1and–1:twoterminalstates;回报+1和-1:两个终点状态;reward–0.04:otherno-terminalstates.回报-0.04:其它非终点状态。Goal:maximizesumofrewards.目标:回报值最大化。+1Pit-1Start1234123(a)0.10.10.8(b)Agentlivesinagrid,wallsblockagent’spath.Stochasticmovement.智能体在格子中,围墙挡住了智能体的去路。随机移动。ArtificialIntelligence::Planning::ClassicandReal-worldPlanning4HowtoformalizetheproblemsofDecision-theoreticPlanning?如何对决策理论规划问题进行形式化?MarkovDecisionProcess(MDP)马尔科夫决策过程(MDP)HowtosolvetheproblemsofMarkovDecisionProcess?如何对马尔科夫决策过程进行求解?DynamicProgramming动态规划HowtoFormulizeandSolve如何形式化与求解8.5.Decision-theoreticPlanningArtificialIntelligence5Contents8.5.1.MarkovDecisionProcess8.5.2.DynamicProgrammingArtificialIntelligence::Planning::ClassicandReal-worldPlanning6Itisadiscretetimestochasticcontrolprocess,meansactionoutcomesdependonlyonthecurrentstate.是一种离散时间随机控制过程,意味着动作结果仅仅依赖于当前状态。MarkovDecisionProcess(MDP)马柯夫决策过程(MDP)8.5.1.MarkovDecisionProcessAMarkovDecisionProcess(MDP)isa5-tuple(S,A,T,R,γ),where一个马柯夫决策过程是一个5元组(S,A,P,R,γ),其中asetofstates,s∈S一个状态集,s∈Sasetactions,a∈A一个动作集,a∈Aatransitionmodel,T(s,a,s’)一个迁移模型,T(s,a,s’)Probabilitythatafromsleadstos’,i.e.,P(s’|s,a)从s导出s’的概率,即:P(s’|s,a)arewardfunction,R(s,a,s’)一个回报函数,R(s,a,s’)discount,γ∈[0,1]衰减,γ∈[0,1]ArtificialIntelligence::Planning::ClassicandReal-worldPlanning7Thecoreproblemofclassicalplanning:经典规划的核心问题agentisinadeterministicenvironment,智能体是在一个确定性的环境,solvingtheproblemistofindaplantoachieveitsgoal.求解该问题是找到到一个达其目标的计划。ThecoreproblemofMarkovDecisionProcess(MDP):马尔科夫决策过程的核心问题agentisinadiscretetimestochasticenvironment,智能体处于一个离散时间随机环境,solvingtheproblemistofindapolicytocontrolhisprocess.求解该问题是找到一个控制其过程的策略。CoreProblem核心问题8.5.1.MarkovDecisionProcessFindingpolicyisthecoreproblemtosolveMDPsArtificialIntelligence::Planning::ClassicandReal-worldPlanning8GivenaMDP(S,A,T,R,γ),apolicyisacomputablefunctionπthatoutputsforeachstatesanactiona.给定一个MDP(S,A,T,R,γ),一个策略是一个计算函数π,它对每个状态s生成一个动作a.Adeterministicpolicyπisdefinedas:一个确定性策略被定义为:CoreProblem核心问题8.5.1.MarkovDecisionProcessπ:S→Aπ:S×A→[0,1]whereπ(s,a)≥0and∑aπ(s,a)=1Goalistochooseapolicyπthatwillmaximizesomecumulativefunctionoftherandomrewards.目标是选择一个策略π,使随机回报值的一些累积函数最大化。Astochasticpolicyπcanalsobedefinedas:一个随机策略也可以被定义为:ArtificialIntelligence::Planning::ClassicandReal-worldPlanning9Insequentialdecisionproblems,preferencesareexpressedbetweensequencesofstates.在顺序决策问题中,偏好由状态顺序之间的顺序来表示。Usuallyuseanadditiveutilityfunctions:通常采用一个累加效用函数:UtilitiesandOptimalPolicies效用和优化策略8.5.2.DynamicProgrammingU([s0,s1,s2,…])=R(s0)+R(s1)+R(s2)+…=∑iR(si)Utilityofastate(a.k.a.itsvalue)isdefinedtobe:一个状态(亦称其值)的效用被定义为:U(si)=expectedsumofrewardsuntilterminationassumingoptimalactions.假设最佳动作结束之前的预期回报值的总和Twooptimalpolicies:ValueIterationandPolicyIteration.两个优化策略:值迭代和策略迭代。ArtificialIntelligence::Planning::ClassicandReal-worldPlanning10Basicidea:基本思想calculatetheutilityofeachstate,andthenusethestateutilitiestoselectanoptimalactionineachstate.计算每个状态的效用,然后使用该状态效用在每个状态中选择一个最佳动作。πfunctionisnotused;insteadthevalueofπiscalculatedwithinU(s).不使用π函数;而π值在U(s)中计算。Bellmanequationforutilities:贝尔曼效用等式:1)ValueIteration值迭代8.5.2.DynamicProgrammingBellmanequationisthebasisofvalueiterationalgorithm.U(s)=R(s)+γmaxα∈A(s)σ𝑠′P(s′|s,a)U(s′)ArtificialIntelligence::Planning::ClassicandReal-worldPlanning111)ValueIteration值迭代8.5.2.DynamicProgrammingThevalueiterationalgorithmforcalculatingutilitiesofstates.计算状态效用的值迭代算法ArtificialIntelligence::Planning::ClassicandReal-worldPlanning12Basicidea:alternatethetwophases.基本思想:交替执行如下两个阶段:Policyevaluation:策略迭代givenapolicyπi,calculateutilityUiofeachstateifπiweretobeexecuted.给定一个策略πi,如果πi被执行的话,计算每个状态的效用Ui。2)PolicyIteration策略迭代8.5.2.DynamicProgrammingPolicyimprovement:策略改善calculateanewMEU(maximumexpectedutility)policyπi+1,usingone-steplook-aheadbasedonUi.使用基于Ui的提前看一步法,计算一个新的MEU(最大期待效用)策略πi+1。Ui(s)=R(s)+γσ𝑠′P(s′|s,πi(s))Ui(s′)π∗(s)=γargmaxα∈A(s)σ𝑠′P(s′|s,a)U(s′)ArtificialIntelligence::Planning::ClassicandReal-worldPlanning132)PolicyIteration策略迭代8.5.2.DynamicProgrammingThepolicyiterationalgorithmforcalculatinganoptimalpolicy.计算最佳策略的值迭代算法

1 / 14
下载文档,编辑使用

©2015-2020 m.777doc.com 三七文档.

备案号:鲁ICP备2024069028号-1 客服联系 QQ:2149211541

×
保存成功