AlgorithmsforReinforcementLearningDraftofthelecturepublishedintheSynthesisLecturesonArticialIntelligenceandMachineLearningseriesbyMorgan&ClaypoolPublishersCsabaSzepesvariJune9,2009Contents1Overview32Markovdecisionprocesses72.1Preliminaries...................................72.2MarkovDecisionProcesses............................82.3Valuefunctions..................................122.4DynamicprogrammingalgorithmsforsolvingMDPs..............163Valuepredictionproblems173.1Temporaldierencelearninginnitestatespaces...............183.1.1TabularTD(0)..............................183.1.2Every-visitMonte-Carlo.........................213.1.3TD():UnifyingMonte-CarloandTD(0)................233.2Algorithmsforlargestatespaces........................253.2.1TD()withfunctionapproximation...................293.2.2Gradienttemporaldierencelearning..................333.2.3Least-squaresmethods..........................36Lastupdate:May18,201313.2.4Thechoiceofthefunctionspace.....................424Control454.1Acatalogoflearningproblems..........................454.2Closed-loopinteractivelearning.........................474.2.1Onlinelearninginbandits........................474.2.2Activelearninginbandits........................494.2.3ActivelearninginMarkovDecisionProcesses.............504.2.4OnlinelearninginMarkovDecisionProcesses.............514.3Directmethods..................................564.3.1Q-learninginniteMDPs........................564.3.2Q-learningwithfunctionapproximation................594.4Actor-criticmethods...............................624.4.1Implementingacritic...........................644.4.2Implementinganactor..........................655Forfurtherexploration725.1Furtherreading..................................725.2Applications....................................735.3Software......................................735.4Acknowledgements................................73AThetheoryofdiscountedMarkoviandecisionprocesses74A.1ContractionsandBanach'sxed-pointtheorem................74A.2ApplicationtoMDPs...............................78AbstractReinforcementlearningisalearningparadigmconcernedwithlearningtocontrolasystemsoastomaximizeanumericalperformancemeasurethatexpressesalong-termobjective.Whatdistinguishesreinforcementlearningfromsupervisedlearningisthatonlypartialfeedbackisgiventothelearneraboutthelearner'spredictions.Further,thepredictionsmayhavelongtermeectsthroughinuencingthefuturestateofthecontrolledsystem.Thus,timeplaysaspecialrole.Thegoalinreinforcementlearningistodevelopecientlearningalgorithms,aswellastounderstandthealgorithms'meritsandlimitations.Reinforcementlearningisofgreatinterestbecauseofthelargenumberofpracticalapplicationsthatitcanbeusedtoaddress,rangingfromproblemsinarticialintelligencetooperationsresearchorcontrolengineering.Inthisbook,wefocusonthosealgorithmsofreinforcementlearningthatbuildonthepowerfultheoryofdynamicprogramming.Wegiveafairlycomprehensivecatalogoflearningproblems,2Figure1:Thebasicreinforcementlearningscenariodescribethecoreideastogetherwithalargenumberofstateoftheartalgorithms,followedbythediscussionoftheirtheoreticalpropertiesandlimitations.Keywords:reinforcementlearning;MarkovDecisionProcesses;temporaldierencelearn-ing;stochasticapproximation;two-timescalestochasticapproximation;Monte-Carlometh-ods;simulationoptimization;functionapproximation;stochasticgradientmethods;least-squaresmethods;overtting;bias-variancetradeo;onlinelearning;activelearning;plan-ning;simulation;PAC-learning;Q-learning;actor-criticmethods;policygradient;naturalgradient1OverviewReinforcementlearning(RL)referstobothalearningproblemandasubeldofmachinelearning.Asalearningproblem,itreferstolearningtocontrolasystemsoastomaxi-mizesomenumericalvaluewhichrepresentsalong-termobjective.AtypicalsettingwherereinforcementlearningoperatesisshowninFigure1:Acontrollerreceivesthecontrolledsystem'sstateandarewardassociatedwiththelaststatetransition.Itthencalculatesanactionwhichissentbacktothesystem.Inresponse,thesystemmakesatransitiontoanewstateandthecycleisrepeated.Theproblemistolearnawayofcontrollingthesystemsoastomaximizethetotalreward.Thelearningproblemsdierinthedetailsofhowthedataiscollectedandhowperformanceismeasured.Inthisbook,weassumethatthesystemthatwewishtocontrolisstochastic.Further,weassumethatthemeasurementsavailableonthesystem'sstatearedetailedenoughsothatthethecontrollercanavoidreasoningabouthowtocollectinformationaboutthe3state.ProblemswiththesecharacteristicsarebestdescribedintheframeworkofMarkovianDecisionProcesses(MDPs).Thestandardapproachto`solve'MDPsistousedynamicprogramming,whichtransformstheproblemofndingagoodcontrollerintotheproblemofndingagoodvaluefunction.However,apartfromthesimplestcaseswhentheMDPhasveryfewstatesandactions,dynamicprogrammingisinfeasible.TheRLalgorithmsthatwediscussherecanbethoughtofasawayofturningtheinfeasibledynamicprogrammingmethodsintopracticalalgorithmssothattheycanbeappliedtolarge-scaleproblems.TherearetwokeyideasthatallowRLalgorithmstoachievethisgoal.Th