Model-Based Reinforcement Learning in Dynamic Envi

整理文档很辛苦,赏杯茶钱您下走!

免费阅读已结束,点击下载阅读编辑剩下 ...

阅读已结束,您可以下载文档离线阅读编辑

资源描述

Model-BasedReinforementLearninginDynamiEnvironmentsTehnialReportUU-CS-2002-029MaroA.Wieringmaros.uu.nlIntelligentSystemsGroupInstituteofInformationandComputingSienesUtrehtUniversityAbstratWestudyusingreinforementlearninginpartiulardynamienviron-ments.Ourenvironmentsanontainmanydynamiobjetswhihmakesoptimalplanninghard.Onewayofusinginformationaboutalldynamiobjetsistoexpandthestatedesription,butthisresultsinahighdi-mensionalpoliyspae.Ourapproahistoinstantiateinformationaboutdynamiobjetsinthemodeloftheenvironmentandtoreplanusingmodel-basedreinforementlearningwheneverthisinformationhanges.Furthermore,ourapproahanbeombinedwithana-priorimodelofthehangingpartsoftheenvironment,whihenablestheagenttooptimallyplanaourseofation.ResultsonanavigationtaskinaWumpus-likeenvironmentwithmultipledynamihostilespideragentsshowthatoursystemisabletolearngoodsolutionsminimizingtheriskofhittingspi-deragents.Furtherexperimentsshowthatthetimeomplexityofthealgorithmsaleswellwhenmoreinformationisinstantiatedinthemodel.Keywords:ReinforementLearning,DynamiEnvironments,Model-basedRL,InstantiatingInformation,Replanning,POMDPs,Wumpus1IntrodutionReinforementlearning.Reinforementlearning(SuttonandBarto,1998;Kaelblingetal.,1996)anbeusedtolearntoontrolanagentbylettingtheagentinteratwithitsenvironmentandlearnfromtheobtainedfeedbak(re-wardsignals).Usingatrial-and-errorproess,areinforement-learning(RL)agentisabletolearnapoliy(orplan)whihoptimizestheumulativerewardintakeoftheagentovertime.Reinforementlearninghasbeenappliedsuess-fullyinpartiularstationaryenvironmentssuhasinhekers(Samuel,1959),bakgammon(Tesauro,1992),andhess(Baxteretal.,1997).ReinforementlearninghasalsobeenappliedtondgoodsolutionsfordiÆultmulti-agentproblemssuhaselevatorontrol(CritesandBarto,1996),networkrouting(LittmanandBoyan,1993),andtraÆlightontrol(Wiering,2000).RLhasonlybeenusedfewtimesinsingleagentnon-stationaryenvironments,however.1Path-planningproblemsinnon-stationaryenvironmentsareinfatpartiallyobservableMarkovdeisionproblems(POMDPs)(Lovejoy,1991),whihareknowntobehardtosolveexatly.DayanandSejnowski(1996)onentratethemselvesonthedualontrolorexplorationproblemwherethereistheneedofdetetinghangesinahangingenvironment,whiletheagentshouldattogainasmuhrewardaspossible.BoyanandLittman(2001)useatemporalmodeltotakehangesoftheenvironmentintoaountwhenomputingapol-iy.InthispaperweareinterestedinapplyingRLtolearntoontrolagentsindynamienvironments.Dynamienvironments.Learningindynamienvironmentsishard,sinetheagentneedstostayinformedaboutthestatusofalldynamiob-jetsintheenvironment.Thisanbedonebyaugmentingthestatespaewithadesriptionofthestatusofalldynamiobjets,butthismayquiklyauseastatespaeexplosion.Furthermore,theagentmaynotexatlyknowthestatusofanobjetandthereforehastodealwithunertaininformation.Usingun-ertaininformationaspartofthestatespaeishard,sineitmakesthestatespaeontinuousandhighdimensional.Instantiatinginformationinthemodel.Thereexistsanothermethodforusingknowledgeaboutdynamiobjets:instantiatetheinformationaboutthedynamiobjetsintheworldmodelandthenusetherevisedworldmodeltoomputeanewpoliy.E.g.ifadooranbeopenorlosed,andweknowwhetherthedoorislosed,weansetnewtransitionprobabilitiesbetweenstatesintheworldmodelsuhthatthisinformationanbeusedbytheagent.Onethemodelisupdatedusingtheurrentlyavailableinformation,dynamiprogramming-likealgorithms(Bellman,1957;MooreandAtkeson,1993)anbeusedtoomputeanewpoliy.Inthisway,wehaveanadaptiveagentwhihtakesurrentlyknowninformationintoaountforomputingations,andwhihreplansonethedynamiinformationhanges.Thisishardtodowithotherplanningmethods,espeiallyforlosedloopontrolinstohastidy-namienvironments.Furthermore,theagentouldalsoinstantiateinformationreeivedbyommuniationwhihanbeusefulformulti-agentreinforementlearning.Althoughsharingpoliies(Tan,1993)isonewayforooperativemulti-agentlearning,ommuniationwithinstantiatinginformationanalsobeusedfornon-ooperativeorsemiooperativeenvironments.Usingpriorknowledge.Oftenreinforementlearningisusedtolearnontrolknowledgefromsrath,i.e.withoutusinga-prioriknowledge.Weknow,however,thattheuseofsomekindofa-prioriknowledgeanbeverybeneial.Forexample,ifpartiularationsareheavilypunishedwedonotwanttoexplorethoseations,butratherreasonabouttheonsequenesoftheseationsusingana-prioridesignedmodel.A-prioriknowledgeanalsobeusedtomodeladynamienvironmentsothatthisknowledgeanbepresentedtotheRLagent.Thisenablestheagenttoreasonaboutthedynamisoftheenvironmentwhihmaybeneessarytosolveapartiularproblem,whereproblemsmayariseoneaftertheother.AsanexamplethinkaboutanagentwhihiswalkinginaityandusesRLtolearnamapoftheity.Aftersometime,theagentmayhavethedesiretodrinksomethinginabar.Onetheagententerssomebar,itoulduseana-priorimodelofbarstounderstandwhihdynamientities,2suhasabarkeeper,otherustomers,tablesandhairset.playaroleinthebar-setting.Soitanusethismodel,llintheatualsituationusingsensordata(e.g.,vision)andomputeapoliy(orplan)toattainitsurrentgoal.Iftheagentdisoversmoreinformationaboutpartiular(dynami)entities,itanagaininstantiatethisinthemodeloftheurrentbarsitu

1 / 20
下载文档,编辑使用

©2015-2020 m.777doc.com 三七文档.

备案号:鲁ICP备2024069028号-1 客服联系 QQ:2149211541

×
保存成功