Average cost temporal-difference learning

整理文档很辛苦,赏杯茶钱您下走!

免费阅读已结束,点击下载阅读编辑剩下 ...

阅读已结束,您可以下载文档离线阅读编辑

资源描述

LIDS-P-2390May1997AverageCostTemporal{DierenceLearning1JohnN.TsitsiklisandBenjaminVanRoyLaboratoryforInformationandDecisionSystemsMassachusettsInstituteofTechnologyCambridge,MA02139e-mail:jnt@mit.edu,bvr@mit.edu1ThisresearchwassupportedbytheNSFundergrantDMI-9625489andtheAFOSRundergrantF49620-95-1-0219.ABSTRACTWeproposeavariantoftemporal{dierencelearningthatapproximatesaverageanddierentialcostsofanirreducibleaperiodicMarkovchain.ApproximationsarecomprisedoflinearcombinationsofxedbasisfunctionswhoseweightsareincrementallyupdatedduringasingleendlesstrajectoryoftheMarkovchain.Wepresentaproofofconvergence(withprobability1),andacharacterizationofthelimitofconvergence.Wealsoprovideaboundontheresultingapproximationerrorthatexhibitsaninterestingdependenceonthe\mixingtimeoftheMarkovchain.Theresultsparallelpreviousworkbytheauthors,involvingapproximationsofdiscountedcost{to{go.11IntroductionTemporal{dierencelearning,originallyproposedbySutton(1988),isanalgorithmforapproximatingthecost{to{gofunctionofaMarkovchain(theexpectedfuturecost,asafunctionoftheinitialstate).Givenasetofbasisfunctions,thealgorithmtunesavectorofweightssothattheweightedcombinationofthebasisfunctionsapproximatesthecost{to{gofunction.TheweightsareiterativelyadaptedbasedoninformationdrawnfromeithersimulationorobservationofaMarkovprocess.Updatesoccuruponeachstatetransitionwiththeobjectiveofimprovingtheapproximationastimeprogresses.Thereasonforourinterestincost{to{gofunctionsistheircentralroleindynamicprogrammingalgorithmsforsolvingMarkovdecisionproblems.Inparticular,giventhecost{to{gofunctionassociatedwithagivencontrolpolicy,onecanperformaniterationoftheclassicalpolicyiterationmethodtoobtainanimprovedpolicy(Bertsekas,1995).However,whenthestatespaceislarge,theexactcomputationofthecost{to{gofunctionbecomesinfeasible,andthisiswhyweareinterestedinapproximations.AcomprehensiveconvergenceanalysisforthecaseofdiscountedMarkovchainshasbeenprovidedbytheauthors(TsitsiklisandVanRoy,1997).Asimpliedversionofthatwork,togetherwithextensionstothecaseofundiscountedabsorbingMarkovchains,ispresentedin(BertsekasandTsitsiklis,1995).Relatedanalysesaregivenby(Sutton,1988),(Dayan,1992),(Gurvitsetal.,1994),and(Pineda,1996).Thepurposeofthepresentpaperistoproposeavariantoftemporal{dierencelearningthatissuitableforapproximatingdier-entialcostfunctionsofundiscountedMarkovchains(i.e.,solutionstoPoisson’sequation),andtoextendtheanalysisof(TsitsiklisandVanRoy,1997)tothisnewcontext.Thecontributionsofthispaperincludethefollowing:1.Atemporal{dierencelearningalgorithmthatapproximatesdierentialcostfunctionsisproposed.2.Convergence(withprobability1)isestablishedforthecasewhereapproximationsaregeneratedbylinearcombinationsofbasisfunctionsoveranitestatespace.3.Thelimitofconvergenceischaracterizedasthesolutiontoasetofinterpretablelinearequations,andaboundisplacedontheresultingapproximationerror.Furthermore,arelationshipbetweentheerrorboundandthe\mixingtimeoftheMarkovchainisidentied.Thispaperisnotthersttoconsidersimulation{basedmethodsthatiterativelyevaluatedierentialcostfunctions.However,thealgorithmsthathavebeenexploredinthiscontextgenerallymakeuseoflook{uptablerepresentations,whichinvolvestoringandupdatingonevalueperstateinthestatespace.Wereferthereaderto(Mahadevan,1996)forasurveyofrelevantexperimentalworkandto(Abounadi,Bertsekas,andBorkar,1997)foratheoreticaltreatment.Thereisnopriorworkexplicitlydealingwithapproximationsofdierentialcostfunc-tions.ItisknownthatthedierentialcostfunctionofaninnitehorizonMarkovchainisthesameasthecost{to{gofunctionofanauxiliaryabsorbingMarkovchain(Bertsekas,1995;BertsekasandTsitsiklis,1996).Thisrelationshipmotivatesonewayofusingtemporal{dierencelearningtoapproximateadierentialcostfunction,namely,derivingtheauxil-iaryabsorbingMarkovchainandthenemployinganexistingversionoftemporal{dierence2learning.However,thisreductioncanaectapproximationsinundesirableways,aswediscussnext.Intemporal{dierencelearning,eachweightupdateisdependentonahistoryofvis-itedstates.Whentemporal{dierencelearningisappliedtoanabsorbingMarkovchain,multiplenitetrajectories(eachterminatingatanabsorbingstate)aresimulated.Weightupdatesoccurduringthesesimulations,andthehistoryofvisitedstatesiserasedupontheterminationofeachtrajectory.EventhoughrestartingtherecordofvisitedstatesisappropriateforanabsorbingMarkovchain,itisunnaturalfortheoriginalinnitehorizonMarkovchain.Duetothispeculiarityintroducedbythereduction,itispreferabletouseavariantoftemporal{dierencelearningdesignedspecicallyforapproximatingdierentialcostfunctions,astheonewewillintroduceinthispaper.Theremainderofthispaperisorganizedasfollows.InSection2,weprovideapre-cisedenitionofthealgorithm.Section3presentsourconvergenceresult,togetherwithassumptionsandaproof.InSection4,wedevelopaboundfortheapproximationerrorassociatedwiththelimitofconvergence.Section5presentsandanalyzesanothervariantoftemporal{dierencelearning.Somenewinsightsthatstemfromtheanalysisarealsodiscussed.Finally,concludingremarksaremadeinSection6.2AverageC

1 / 24
下载文档,编辑使用

©2015-2020 m.777doc.com 三七文档.

备案号:鲁ICP备2024069028号-1 客服联系 QQ:2149211541

×
保存成功