Average cost temporal-difference learning

伊特瑞克
2 ℃
2020-06-22

整理文档很辛苦，赏杯茶钱您下走！

还剩 ... 页未读，继续阅读 >>

免费阅读已结束，点击下载阅读编辑剩下 ... 页

阅读已结束，您可以下载文档离线阅读编辑

资源描述

LIDS-P-2390May1997AverageCostTemporal{DierenceLearning1JohnN.TsitsiklisandBenjaminVanRoyLaboratoryforInformationandDecisionSystemsMassachusettsInstituteofTechnologyCambridge,MA02139e-mail:jnt@mit.edu,bvr@mit.edu1ThisresearchwassupportedbytheNSFundergrantDMI-9625489andtheAFOSRundergrantF49620-95-1-0219.ABSTRACTWeproposeavariantoftemporal{dierencelearningthatapproximatesaverageanddierentialcostsofanirreducibleaperiodicMarkovchain.ApproximationsarecomprisedoflinearcombinationsofxedbasisfunctionswhoseweightsareincrementallyupdatedduringasingleendlesstrajectoryoftheMarkovchain.Wepresentaproofofconvergence(withprobability1),andacharacterizationofthelimitofconvergence.Wealsoprovideaboundontheresultingapproximationerrorthatexhibitsaninterestingdependenceonthe\mixingtimeoftheMarkovchain.Theresultsparallelpreviousworkbytheauthors,involvingapproximationsofdiscountedcost{to{go.11IntroductionTemporal{dierencelearning,originallyproposedbySutton(1988),isanalgorithmforapproximatingthecost{to{gofunctionofaMarkovchain(theexpectedfuturecost,asafunctionoftheinitialstate).Givenasetofbasisfunctions,thealgorithmtunesavectorofweightssothattheweightedcombinationofthebasisfunctionsapproximatesthecost{to{gofunction.TheweightsareiterativelyadaptedbasedoninformationdrawnfromeithersimulationorobservationofaMarkovprocess.Updatesoccuruponeachstatetransitionwiththeobjectiveofimprovingtheapproximationastimeprogresses.Thereasonforourinterestincost{to{gofunctionsistheircentralroleindynamicprogrammingalgorithmsforsolvingMarkovdecisionproblems.Inparticular,giventhecost{to{gofunctionassociatedwithagivencontrolpolicy,onecanperformaniterationoftheclassicalpolicyiterationmethodtoobtainanimprovedpolicy(Bertsekas,1995).However,whenthestatespaceislarge,theexactcomputationofthecost{to{gofunctionbecomesinfeasible,andthisiswhyweareinterestedinapproximations.AcomprehensiveconvergenceanalysisforthecaseofdiscountedMarkovchainshasbeenprovidedbytheauthors(TsitsiklisandVanRoy,1997).Asimpliedversionofthatwork,togetherwithextensionstothecaseofundiscountedabsorbingMarkovchains,ispresentedin(BertsekasandTsitsiklis,1995).Relatedanalysesaregivenby(Sutton,1988),(Dayan,1992),(Gurvitsetal.,1994),and(Pineda,1996).Thepurposeofthepresentpaperistoproposeavariantoftemporal{dierencelearningthatissuitableforapproximatingdier-entialcostfunctionsofundiscountedMarkovchains(i.e.,solutionstoPoisson’sequation),andtoextendtheanalysisof(TsitsiklisandVanRoy,1997)tothisnewcontext.Thecontributionsofthispaperincludethefollowing:1.Atemporal{dierencelearningalgorithmthatapproximatesdierentialcostfunctionsisproposed.2.Convergence(withprobability1)isestablishedforthecasewhereapproximationsaregeneratedbylinearcombinationsofbasisfunctionsoveranitestatespace.3.Thelimitofconvergenceischaracterizedasthesolutiontoasetofinterpretablelinearequations,andaboundisplacedontheresultingapproximationerror.Furthermore,arelationshipbetweentheerrorboundandthe\mixingtimeoftheMarkovchainisidentied.Thispaperisnotthersttoconsidersimulation{basedmethodsthatiterativelyevaluatedierentialcostfunctions.However,thealgorithmsthathavebeenexploredinthiscontextgenerallymakeuseoflook{uptablerepresentations,whichinvolvestoringandupdatingonevalueperstateinthestatespace.Wereferthereaderto(Mahadevan,1996)forasurveyofrelevantexperimentalworkandto(Abounadi,Bertsekas,andBorkar,1997)foratheoreticaltreatment.Thereisnopriorworkexplicitlydealingwithapproximationsofdierentialcostfunc-tions.ItisknownthatthedierentialcostfunctionofaninnitehorizonMarkovchainisthesameasthecost{to{gofunctionofanauxiliaryabsorbingMarkovchain(Bertsekas,1995;BertsekasandTsitsiklis,1996).Thisrelationshipmotivatesonewayofusingtemporal{dierencelearningtoapproximateadierentialcostfunction,namely,derivingtheauxil-iaryabsorbingMarkovchainandthenemployinganexistingversionoftemporal{dierence2learning.However,thisreductioncanaectapproximationsinundesirableways,aswediscussnext.Intemporal{dierencelearning,eachweightupdateisdependentonahistoryofvis-itedstates.Whentemporal{dierencelearningisappliedtoanabsorbingMarkovchain,multiplenitetrajectories(eachterminatingatanabsorbingstate)aresimulated.Weightupdatesoccurduringthesesimulations,andthehistoryofvisitedstatesiserasedupontheterminationofeachtrajectory.EventhoughrestartingtherecordofvisitedstatesisappropriateforanabsorbingMarkovchain,itisunnaturalfortheoriginalinnitehorizonMarkovchain.Duetothispeculiarityintroducedbythereduction,itispreferabletouseavariantoftemporal{dierencelearningdesignedspecicallyforapproximatingdierentialcostfunctions,astheonewewillintroduceinthispaper.Theremainderofthispaperisorganizedasfollows.InSection2,weprovideapre-cisedenitionofthealgorithm.Section3presentsourconvergenceresult,togetherwithassumptionsandaproof.InSection4,wedevelopaboundfortheapproximationerrorassociatedwiththelimitofconvergence.Section5presentsandanalyzesanothervariantoftemporal{dierencelearning.Somenewinsightsthatstemfromtheanalysisarealsodiscussed.Finally,concludingremarksaremadeinSection6.2AverageC