RLAlgsInMDPs

整理文档很辛苦,赏杯茶钱您下走!

免费阅读已结束,点击下载阅读编辑剩下 ...

阅读已结束,您可以下载文档离线阅读编辑

资源描述

AlgorithmsforReinforcementLearningDraftofthelecturepublishedintheSynthesisLecturesonArti cialIntelligenceandMachineLearningseriesbyMorgan&ClaypoolPublishersCsabaSzepesvariJune9,2009Contents1Overview32Markovdecisionprocesses72.1Preliminaries...................................72.2MarkovDecisionProcesses............................82.3Valuefunctions..................................122.4DynamicprogrammingalgorithmsforsolvingMDPs..............163Valuepredictionproblems173.1Temporaldi erencelearningin nitestatespaces...............183.1.1TabularTD(0)..............................183.1.2Every-visitMonte-Carlo.........................213.1.3TD():UnifyingMonte-CarloandTD(0)................233.2Algorithmsforlargestatespaces........................253.2.1TD()withfunctionapproximation...................293.2.2Gradienttemporaldi erencelearning..................333.2.3Least-squaresmethods..........................36Lastupdate:May18,201313.2.4Thechoiceofthefunctionspace.....................424Control454.1Acatalogoflearningproblems..........................454.2Closed-loopinteractivelearning.........................474.2.1Onlinelearninginbandits........................474.2.2Activelearninginbandits........................494.2.3ActivelearninginMarkovDecisionProcesses.............504.2.4OnlinelearninginMarkovDecisionProcesses.............514.3Directmethods..................................564.3.1Q-learningin niteMDPs........................564.3.2Q-learningwithfunctionapproximation................594.4Actor-criticmethods...............................624.4.1Implementingacritic...........................644.4.2Implementinganactor..........................655Forfurtherexploration725.1Furtherreading..................................725.2Applications....................................735.3Software......................................735.4Acknowledgements................................73AThetheoryofdiscountedMarkoviandecisionprocesses74A.1ContractionsandBanach's xed-pointtheorem................74A.2ApplicationtoMDPs...............................78AbstractReinforcementlearningisalearningparadigmconcernedwithlearningtocontrolasystemsoastomaximizeanumericalperformancemeasurethatexpressesalong-termobjective.Whatdistinguishesreinforcementlearningfromsupervisedlearningisthatonlypartialfeedbackisgiventothelearneraboutthelearner'spredictions.Further,thepredictionsmayhavelongterme ectsthroughinuencingthefuturestateofthecontrolledsystem.Thus,timeplaysaspecialrole.Thegoalinreinforcementlearningistodevelopecientlearningalgorithms,aswellastounderstandthealgorithms'meritsandlimitations.Reinforcementlearningisofgreatinterestbecauseofthelargenumberofpracticalapplicationsthatitcanbeusedtoaddress,rangingfromproblemsinarti cialintelligencetooperationsresearchorcontrolengineering.Inthisbook,wefocusonthosealgorithmsofreinforcementlearningthatbuildonthepowerfultheoryofdynamicprogramming.Wegiveafairlycomprehensivecatalogoflearningproblems,2Figure1:Thebasicreinforcementlearningscenariodescribethecoreideastogetherwithalargenumberofstateoftheartalgorithms,followedbythediscussionoftheirtheoreticalpropertiesandlimitations.Keywords:reinforcementlearning;MarkovDecisionProcesses;temporaldi erencelearn-ing;stochasticapproximation;two-timescalestochasticapproximation;Monte-Carlometh-ods;simulationoptimization;functionapproximation;stochasticgradientmethods;least-squaresmethods;over tting;bias-variancetradeo ;onlinelearning;activelearning;plan-ning;simulation;PAC-learning;Q-learning;actor-criticmethods;policygradient;naturalgradient1OverviewReinforcementlearning(RL)referstobothalearningproblemandasub eldofmachinelearning.Asalearningproblem,itreferstolearningtocontrolasystemsoastomaxi-mizesomenumericalvaluewhichrepresentsalong-termobjective.AtypicalsettingwherereinforcementlearningoperatesisshowninFigure1:Acontrollerreceivesthecontrolledsystem'sstateandarewardassociatedwiththelaststatetransition.Itthencalculatesanactionwhichissentbacktothesystem.Inresponse,thesystemmakesatransitiontoanewstateandthecycleisrepeated.Theproblemistolearnawayofcontrollingthesystemsoastomaximizethetotalreward.Thelearningproblemsdi erinthedetailsofhowthedataiscollectedandhowperformanceismeasured.Inthisbook,weassumethatthesystemthatwewishtocontrolisstochastic.Further,weassumethatthemeasurementsavailableonthesystem'sstatearedetailedenoughsothatthethecontrollercanavoidreasoningabouthowtocollectinformationaboutthe3state.ProblemswiththesecharacteristicsarebestdescribedintheframeworkofMarkovianDecisionProcesses(MDPs).Thestandardapproachto`solve'MDPsistousedynamicprogramming,whichtransformstheproblemof ndingagoodcontrollerintotheproblemof ndingagoodvaluefunction.However,apartfromthesimplestcaseswhentheMDPhasveryfewstatesandactions,dynamicprogrammingisinfeasible.TheRLalgorithmsthatwediscussherecanbethoughtofasawayofturningtheinfeasibledynamicprogrammingmethodsintopracticalalgorithmssothattheycanbeappliedtolarge-scaleproblems.TherearetwokeyideasthatallowRLalgorithmstoachievethisgoal.Th

1 / 98
下载文档,编辑使用

©2015-2020 m.777doc.com 三七文档.

备案号:鲁ICP备2024069028号-1 客服联系 QQ:2149211541

×
保存成功