reinforcement-learning.ppt

整理文档很辛苦,赏杯茶钱您下走!

免费阅读已结束,点击下载阅读编辑剩下 ...

阅读已结束,您可以下载文档离线阅读编辑

资源描述

ReinforcementLearningPeterBodíkcs294-34PreviousLectures•Supervisedlearning–classification,regression•Unsupervisedlearning–clustering,dimensionalityreduction•Reinforcementlearning–generalizationofsupervisedlearning–learnfrominteractionw/environmenttoachieveagoalenvironmentagentactionrewardnewstateToday•examples•definingaMarkovDecisionProcess–solvinganMDPusingDynamicProgramming•ReinforcementLearning–MonteCarlomethods–Temporal-Differencelearning•miscellaneous–staterepresentation–functionapproximation–rewardsRobotinaroom+1-1STARTactions:UP,DOWN,LEFT,RIGHTUP80%moveUP10%moveLEFT10%moveRIGHT•reward+1at[4,3],-1at[4,2]•reward-0.04foreachstep•what’sthestrategytoachievemaxreward?•whatiftheactionsweredeterministic?Otherexamples•pole-balancing•walkingrobot(applet)•TD-Gammon[GerryTesauro]•helicopter[AndrewNg]•noteacherwhowouldsay“good”or“bad”–isreward“10”goodorbad?–rewardscouldbedelayed•exploretheenvironmentandlearnfromtheexperience–notjustblindsearch,trytobesmartaboutitOutline•examples•definingaMarkovDecisionProcess–solvinganMDPusingDynamicProgramming•ReinforcementLearning–MonteCarlomethods–Temporal-Differencelearning•miscellaneous–staterepresentation–functionapproximation–rewardsRobotinaroom+1-1STARTactions:UP,DOWN,LEFT,RIGHTUP80%moveUP10%moveLEFT10%moveRIGHTreward+1at[4,3],-1at[4,2]reward-0.04foreachstep•states•actions•rewards•whatisthesolution?Isthisasolution?+1-1•onlyifactionsdeterministic–notinthiscase(actionsarestochastic)•solution/policy–mappingfromeachstatetoanactionOptimalpolicy+1-1Rewardforeachstep-2+1-1Rewardforeachstep:-0.1+1-1Rewardforeachstep:-0.04+1-1Rewardforeachstep:-0.01+1-1Rewardforeachstep:+0.01+1-1MarkovDecisionProcess(MDP)•setofstatesS,setofactionsA,initialstateS0•transitionmodelP(s’|s,a)–P([1,2]|[1,1],up)=0.8–Markovassumption•rewardfunctionr(s)–r([4,3])=+1•goal:maximizecumulativerewardinthelongrun•policy:mappingfromStoA–(s)or(s,a)•reinforcementlearning–transitionsandrewardsusuallynotavailable–howtochangethepolicybasedonexperience–howtoexploretheenvironmentenvironmentagentactionrewardnewstateComputingreturnfromrewards•episodic(vs.continuing)tasks–“gameover”afterNsteps–optimalpolicydependsonN;hardertoanalyze•additiverewards–V(s0,s1,…)=r(s0)+r(s1)+r(s2)+…–infinitevalueforcontinuingtasks•discountedrewards–V(s0,s1,…)=r(s0)+γ*r(s1)+γ2*r(s2)+…–valueboundedifrewardsboundedValuefunctions•statevaluefunction:V(s)–expectedreturnwhenstartinginsandfollowing•state-actionvaluefunction:Q(s,a)–expectedreturnwhenstartingins,performinga,andfollowing•usefulforfindingtheoptimalpolicy–canestimatefromexperience–pickthebestactionusingQ(s,a)•Bellmanequationsas’rOptimalvaluefunctions•there’sasetofoptimalpolicies–Vdefinespartialorderingonpolicies–theysharethesameoptimalvaluefunction•Bellmanoptimalityequation–systemofnnon-linearequations–solveforV*(s)–easytoextracttheoptimalpolicy•havingQ*(s,a)makesitevensimplersas’rOutline•examples•definingaMarkovDecisionProcess–solvinganMDPusingDynamicProgramming•ReinforcementLearning–MonteCarlomethods–Temporal-Differencelearning•miscellaneous–staterepresentation–functionapproximation–rewardsDynamicprogramming•mainidea–usevaluefunctionstostructurethesearchforgoodpolicies–needaperfectmodeloftheenvironment•twomaincomponents–policyevaluation:computeVfrom–policyimprovement:improvebasedonV–startwithanarbitrarypolicy–repeatevaluation/improvementuntilconvergencePolicyevaluation/improvement•policyevaluation:-V–Bellmaneqn’sdefineasystemofneqn’s–couldsolve,butwilluseiterativeversion–startwithanarbitraryvaluefunctionV0,iterateuntilVkconverges•policyimprovement:V-’–’eitherstrictlybetterthan,or’isoptimal(if=’)Policy/Valueiteration•Policyiteration–twonestediterations;tooslow–don’tneedtoconvergetoVk•justmovetowardsit•Valueiteration–useBellmanoptimalityequationasanupdate–convergestoV*UsingDP•needcompletemodeloftheenvironmentandrewards–robotinaroom•statespace,actionspace,transitionmodel•canweuseDPtosolve–robotinaroom?–backgammon?–helicopter?•DPbootstraps–updatesestimatesonthebasisofotherestimatesOutline•examples•definingaMarkovDecisionProcess–solvinganMDPusingDynamicProgramming•ReinforcementLearning–MonteCarlomethods–Temporal-Differencelearning•miscellaneous–staterepresentation–functionapproximation–rewardsMonteCarlomethods•don’tneedfullknowledgeofenvironment–justexperience,or–simulatedexperience•averagingsamplereturns–definedonlyforepisodictasks•butsimilartoDP–policyevaluation,policyimprovementMonteCarlopolicyevaluation•wanttoestimateV(s)=expectedreturnstartingfromsandfollowing–estimateasaverageofobservedreturnsinstates•first-visitMC–averagereturnsfollowingthefirstvisittostatess0ss+1-20+1-3+5R1(s)=+2s0s0s0s0s0R2(s)=+1R3(s)=-5R4(s)=+4V(s)≈(2+1–5+4)/4=0.5MonteCarlocontrol•Vnotenoughforpolicyimprovement–needexactmodelofenvironment–•estimateQ(s,a)•MCcontrol–updateaftereachepisode•non-stationaryenvironment•aproblem–greedypolicywon’texploreallactionsMaintainingexploration•keyingredientofRL•deterministic/greedypolicywon’texploreallactions–don’tknowanythingabouttheenvironmentatthebegi

1 / 44
下载文档,编辑使用

©2015-2020 m.777doc.com 三七文档.

备案号:鲁ICP备2024069028号-1 客服联系 QQ:2149211541

×
保存成功