A Simulation-Based Algorithm for Ergodic Control o

整理文档很辛苦,赏杯茶钱您下走!

免费阅读已结束,点击下载阅读编辑剩下 ...

阅读已结束,您可以下载文档离线阅读编辑

资源描述

ASimulation-BasedAlgorithmforErgodicControlofMarkovChainsConditionedonRareEventsS.Bhatnagary,V.S.BorkarzandA.MadhukarxFebruary2006AbstractWestudytheproblemoflong-runaveragecostcontrolofMarkovchainsconditionedonarareevent.Inarelatedrecentwork,asimulationbasedalgorithmforestimatingperformancemeasuresassociatedwithaMarkovchainconditionedonarareeventhasbeendeveloped.Weextendideasfromthisworkanddevelopanadaptivealgorithmforobtaining,online,optimalcontrolpoliciesconditionedonarareevent.Ouralgorithmusesthreetimescalesorstep-sizeschedules.Ontheslowesttimescale,agradientsearchalgorithmforpolicyupdatesthatisbasedonone-simulationsimultaneousperturbationstochasticapproximation(SPSA)typeestimatesisused.DeterministicperturbationsequencesobtainedfromappropriatenormalizedHadamardmatricesareusedhere.Thefasttimescalerecursionscomputetheconditionaltransitionproba-bilitiesofanassociatedchainbyobtainingsolutionstothemultiplicativePoissonequation(foragivenpolicyestimate).Further,theriskparameterassociatedwiththevaluefunctionforagivenpolicyestimateisupdatedonatimescalethatliesinbetweenthetwoscalesabove.Webrieysketchtheconvergenceanalysisofouralgorithmandpresentanumericalapplicationinthesettingofroutingmultipleowsincommunicationnetworks.KeyWords:Markovdecisionprocesses,optimalcontrolconditionedonarareevent,simulationbasedalgorithms,SPSAwithdeterministicperturbations,reinforcementlearning.1IntroductionMarkovdecisionprocesses(MDPs)[5],[35]formageneralframeworkforstudyingproblemsofcontrolofstochasticdynamicsystems(SDS).Manytimes,oneencounterssituationsinvolvingcontrolofSDSconditionedonarareeventofasymptoticallyzeroprobability.Thiscouldbe,e.g.,aproblemofdamagecontrolwhenfacedwithacatastrophicevent.Forinstance,inthesettingofalargecommunicationnetworksuchastheinternet,onemaybeinterestedinobtainingoptimalowCorrespondingauthoryDepartmentofComputerScienceandAutomation,IndianInstituteofScience,Bangalore560012,India.E-Mail:shalabh@csa.iisc.ernet.inzSchoolofTechnologyandComputerScience,TataInstituteofFundamentalResearch,HomiBhabhaRoad,Mumbai400005,India.E-Mail:borkar@tifr.res.inxDepartmentofElectricalEngineering,IndianInstituteofScience,Bangalore560012,India.E-Mail:madhukar@ee.iisc.ernet.in1andcongestioncontrolorroutingstrategiesinasubnetworkgiventhatanextremaleventsuchasalinkfailurehasoccurredinanotherremotesubnetwork.Ourobjectiveinthispaperistoconsideraproblemofthisnaturewhereinarareeventisspeci callyde nedtobethetimeaverageofafunctionoftheMDPanditsassociatedcontrol-valuedprocessexceedingathresholdthatislargerthanitsmean.Weconsiderthein nitehorizonlong-runaveragecostcriterionforourproblemanddeviseanalgorithmbasedonpolicyiterationforthesame.ResearchondevelopingsimulationbasedmethodsforcontrolofSDShasgatheredmomentuminrecenttimes.Theselargelygounderthenamesofneuro-dynamicprogramming(NDP)[7]orreinforcementlearning(RL)[39]andareapplicableinthecaseofsystemsforwhichmodelinformationisnotknownorcomputationallyforbiddinglyexpensive,butoutputdataobtainedeitherthrougharealsystemorasimulatedoneisavailable.Ourproblemdoesnotsharethislastfeature,butwedoborrowcertainalgorithmicparadigmsfromthisliterature.Beforeweproceedfurther,we rstreviewsomerepresentativerecentworkalongtheselines.In[3],analgorithmforlong-runaveragecostMDPsispresented.Theaveragecostgradientisapproximatedusingthatassociatedwithacorrespondingin nitehorizondiscountedcostMDPproblem.Thevarianceoftheestimateshoweverincreasesrapidlyasthediscountfactorisbroughtclosertoone.In[4],certainvariantsbasedonthealgorithmin[3]arepresentedandapplicationsonsomeexperimentalsettingsshown.In[25],aperturbationanalysis(PA)typeapproachisusedtoobtaintheperformancegradientbasedonsamplepathanalysis.In[24],aPA-basedmethodisproposedforsolvinglong-runaveragecostMDPs.Thisrequireskeepingtrackoftheregenerationepochsoftheunderlyingprocessforanypolicyandaggregatingdataoverthese.Theaboveepochscanhoweverbeveryinfrequentinmostreallifesystems.In[32],theaveragecostgradientiscomputedbyassumingthatsamplepathgradientsofperformanceandtransitionprobabilitiesareknowninfunctionalform.AmongstotherRL-basedapproaches,thetemporaldi erence(TD)[39]andQ-learning[42]havebeenpopularinrecenttimes.Thesearebasedonvaluefunctionapproximations.Aparalleldevelopmentisthatofactor-criticalgorithmsbasedontheclassicalpolicyiterationalgorithmindynamicprogramming.Notethattheclassicalpolicyiterationalgorithmproceedsviatwonestedloops{anouterloopinwhichthepolicyimprovementstepisperformedandaninnerloopinwhichthepolicyevaluationstepforthepolicyprescribedbytheouterloopisconducted.Therespectiveoperationsinthetwoloopsareperformedone-after-the-otherinacyclicmanner.Theinnerloopcaninprincipletakealongtimetoconverge,makingtheoverallprocedureslowinpractice.In[29],certainsimulation-basedalgorithmsthatusemulti-timescalestochasticapproximationareproposed.Theideaistousecoupledstochasticrecursionsdrivenbydi erentstep-sizeschedulesortimescales.Therecursioncorrespondingtopolicyevaluationisrunonthefastertimescalewhile2thatcorrespondingtopolicyimprovementisrunontheslowerone.Thus

1 / 30
下载文档,编辑使用

©2015-2020 m.777doc.com 三七文档.

备案号:鲁ICP备2024069028号-1 客服联系 QQ:2149211541

×
保存成功