上海交通大学硕士学位论文发电商报价策略的Q学习模型分析姓名:高瞻申请学位级别:硕士专业:电力系统及其自动化指导教师:宋依群20080201Q80AgentAgentQAgentQQQQAgentASUPPLIERBIDDINGSTRATEGYBASEDONQ-LEARNINGMODELABSTRACTThestormofpowerindustry’sreformationwentaroundtheworldduringthe80thoflastcentury.Inordertoimprovetheefficiencyofproducingelectricity,lowertheprice,optimizetheconfigurationofvariousrenouncements,breaktheleadingstructureofmonopolizationandreachtheoptimizationofsocialbenefit,thecompetitionmechanismmustbeintroducedbydevelopingpowermarket.Agenttechnologyhasbeenwidelyusedrecentlyinthefieldofstudyingandsimulationonpowermarket,becauseitsperformanceinsimulating,forecastinganddealingwiththeuncertaintyandinstabilityofpowermarketprovestobebetterthantraditionalmethods.ThisthesisfocusesontheproblemofmaximizingtheprofitofthesupplierAgentinaday-aheadpowermarket.Thesupplier’sbiddingstrategyismodeledbyQ-learningalgorithm.Itisusedtosimulatetacitcollusionbetweengenerationcompaniestoo.Themodelingprocessofusingthealgorithminpowermarket’ssimulationisdiscussed.Theadvantageofusingthisreinforcementlearningmethodtohelpthesuppliermakingbiddingstrategyisintroducedbysimulation.It’sdiscoveredthatthesuppliercanmakeacompletebiddingstrategytomaximizeitsdailyprofitusingQ-learning.Consideringthecongestionofnetworkandramprate,thesupplierstillcanmakeaccuratejudgmenttothestateofpowermarketwhichleadsthemtochangeitsbiddingstrategy.TheresultofsimulationshowsthatifweincreasethenumberofsupplierswhouseQ-learning,themarketclearingpricewillbehigher.Besides,anotherQ-learningmodelestablishedinthisthesisisabletosimulatethetacitcollusionbetweenlowcostgenerationswhichmaketheprofitofhighcostgenerationsbeless.It’sfoundthatthemostefficientwaytoavoidcollusionistolowerthemarketshareofeverygenerationandincreasetheelasticityofdemand.KEYWORDS:Q-learning,biddingstrategy,Agent,powermarket20082172008217200821811.1801.1.1NETAPJM2002311251.1.231.1.31.21.2.1[1]496[2][3](Cournot,Stackelberg,Supplyfunction)[4][5]1.2.25[6][7][8][9]~[13]1.3AgentAgentAgent[14]AgentAgentAgent[15]AgentAgentAgentAgent[16]AgentMulti-AgentsystemAgent[17]6[18]~[21]Roth-Erev[22][23]DerekWBunnFernandoSOliveira[24]Roth-ErevAgent(NETA,NewElectricityTradingArrangements)NETAAgent[23][24][25](uniformprice)Agent[23][24]32Agent324Agent[25]504Agent[26](payasbid)7-greedyQ(correctingdegree)(learningrate)AgentQQ[27]Q2AgentAgent3AgentNash3AgentQNash[28]QTDAgent(MCP,MarketClearingPrice)MCPAgent(marketmaker)Agent(markettaker)AgentAgentAgentAgentAgent8[29][30]Agent[31](markup)AgentPareto(fixed-incrementpriceprobingstrategy)[32]Agent5%[33]AgentIsabelPraca,CarlosRamosZitaVale[34](adaptedderivative-following)(ScenarioAnalysis)[35]9Agent[33]trial-error(TD,TemporalDifference)[36]Agent[37](LCSs,learningclassifiersystems)Agent[38]Agent[39]~[43]1.4QQQAgent10QQQQ11(UniformPrice)(Payasbid)2411Figure1MarginalcostcurveofGenerationCompany2.12460min0:000:00~1:0023:000:00/MWmaxqminqEconomicLoadPointSupplyq(MW)1260min2.1.1132.1.22.22.2.114212CostAqBqC=++(2-1)CostqABCpAqB=+(2-2)(2-2)(2-2)2.2.2ni=(i=(1,2,3))GiiGiipAqB=+(2-3):GiiGiipqαβ=+(2-4)15iibβ=iα,2.32.3.1(2-5)(2-1)OPF()111min()..0inGiiininGiLiLossiiMijijCostqstqqqqq======−−=≤∑∑∑(2-5),,GiLiLossqqqi()GiCostqi,Mijijqqij16Giqiλ(=1,,)kkmµL(1,1)iinλ=−L()GiiGidCostqdqλ=(2-6)(1)LosskikkiGiGiqqqqλλµ=∂∂=−−∂∂∑(2-7)LossGiqq∂∂iGiqkGiqq∂∂k2.4Q17Q3.13.1.1QQAgentAgentAgentAgentAgentAgentQ180itiiVrγ∞+==∑(3-1)0hiiVr==∑(3-2)01limhihiVrh→∞==∑(3-3)01γ≤irAgent(3-1)(3-2)Agenth(3-3)AgentAgentAgentAgentAgentAgentAgentQ19*πVQ*0()maxiiiVsrπγ∞==∑(3-4)3.1.2MDPAgentAgentAgent(MarkovDecisionProcess)MDP,,,SARδSAAgentδδPr()SAS×→RRSA×→ℜtAgenttsta(,)ttrsat+11ts+AgentπSA→Agents()sπQ203.1.31AgentAgentAgent2(explore)(exploit)AgentAgentε-degreeAgentε1ε−3AgentAgent(generalization)AgentAgentQ213.2QQ1989WatkinsreinforcementlearningWatkinsDyna1992QV(,)QsasaAgentVQAgentAgent3.2.1QAgent(,)saδ(,)rsaQ(,)saδ(,)rsaAgentQonlineofflineQ223.2.2QMarkovdecisionprocessMDPAgentSAtAgenttstaAgent(,)tttrrsa=1(,)tttssaδ+=δrAgentMDP(,)ttsaδ(,)ttrsaAgent:SAπ→tsta()ttsaπ=πts()tVsπ0()ittiiVsrπγ∞+==∑(3-5)tir+tsπ()ttasπ=11()ttasπ++=01γ≤iγ0γ=γ1AgentAgentπs()Vsπoptimalpolicy*π*argmax(),()Vssπππ=∀(3-6)*()Vsπ*()Vs*()VsAgentQ23ss3.2.3Qs(,)rsa*Vγa*argmax[(,)*((,))]arsaVsaπγδ=+(3-7)Agent*VrδAgentrδAgentδr*VQ*V(,)QsasaQsaγ(,)(,)*((,))QsarsaVsaγδ=+(3-8)(,)Qsa(3-7)sa(3-8)(,)Qsa*argmax(,)aQsaπ=(3-9)δrAgent(3-9)Agentsa(,)QsaQQ24QQsa2QAgentFigure2InteractionofAgentinQ-learning3.3Q3.3.1QQ*max(,')aVQsa=(3-10)'(,)(,)max[(,),']aQsarsaQsaaγδ=+(3-11)QQQ∧AgentQAgentQ∧(,)sa(,)Qsa∧Agentsa(,)rrsa='(,)ssaδ=QAgenttstatrQQ25'(,)max(',')aQsarQsaγ∧∧←+(3-12)Agent'sQ∧s(,)Qsa∧MDPosa(,)rsaoAgent3.3.2Q(,)saδ(,)rsasaAgentπVπ0[]itiiVErπγ∞+==∑(3-13)tir+π(3-5)(,)[(,)*((,))]QsaErsaVsaγδ=+'[(,)](',)*(')]sErsaPssaVsγ=+∑(3-14)(',)Pssasa'sQ''(,)[(,)](',)max(',')]asQsaErsaPssaQsaγ=+∑(3-15)(,)rsaQQQ26(,)Qsa∧Q∧Q^11'(,)(1)(,)[max(',')]nnnnnaQsaQsarQsaαηγ∧∧−−←−++(3-16)(,)Qsa∧Q∧AgentQ∧Q∧degreeε−εAgentnηnα11(,)nnvisitssaη=+(3-17)san(,)nvisitssan3.4QQQQ27Q4.1QAgentAgentQ1Agenti1{,}iiimAaa=Lamax(0,]iiαα∈max()/icapiipCap