AlphaGo-deepmind-mastering-goGoogle人工智能围棋算法自然杂志版37

anymore1030
6 ℃
2019-03-22

整理文档很辛苦，赏杯茶钱您下走！

还剩 ... 页未读，继续阅读 >>

免费阅读已结束，点击下载阅读编辑剩下 ... 页

阅读已结束，您可以下载文档离线阅读编辑

资源描述

MasteringtheGameofGowithDeepNeuralNetworksandTreeSearchDavidSilver1*,AjaHuang1*,ChrisJ.Maddison1,ArthurGuez1,LaurentSifre1,GeorgevandenDriessche1,JulianSchrittwieser1,IoannisAntonoglou1,VedaPanneershelvam1,MarcLanctot1,SanderDieleman1,DominikGrewe1,JohnNham2,NalKalchbrenner1,IlyaSutskever2,TimothyLillicrap1,MadeleineLeach1,KorayKavukcuoglu1,ThoreGraepel1,DemisHassabis1.1GoogleDeepMind,5NewStreetSquare,LondonEC4A3TW.2Google,1600AmphitheatreParkway,MountainViewCA94043.*Theseauthorscontributedequallytothiswork.CorrespondenceshouldbeaddressedtoeitherDavidSilver(davidsilver@google.com)orDemisHassabis(demishassabis@google.com).ThegameofGohaslongbeenviewedasthemostchallengingofclassicgamesforar-tiﬁcialintelligenceduetoitsenormoussearchspaceandthedifﬁcultyofevaluatingboardpositionsandmoves.WeintroduceanewapproachtocomputerGothatusesvaluenetworkstoevaluateboardpositionsandpolicynetworkstoselectmoves.Thesedeepneuralnetworksaretrainedbyanovelcombinationofsupervisedlearningfromhumanexpertgames,andreinforcementlearningfromgamesofself-play.Withoutanylookaheadsearch,theneuralnetworksplayGoatthelevelofstate-of-the-artMonte-Carlotreesearchprogramsthatsim-ulatethousandsofrandomgamesofself-play.WealsointroduceanewsearchalgorithmthatcombinesMonte-Carlosimulationwithvalueandpolicynetworks.Usingthissearchal-gorithm,ourprogramAlphaGoachieveda99.8%winningrateagainstotherGoprograms,anddefeatedtheEuropeanGochampionby5gamesto0.Thisistheﬁrsttimethatacom-puterprogramhasdefeatedahumanprofessionalplayerinthefull-sizedgameofGo,afeatpreviouslythoughttobeatleastadecadeaway.Allgamesofperfectinformationhaveanoptimalvaluefunction,v(s),whichdeterminestheoutcomeofthegame,fromeveryboardpositionorstates,underperfectplaybyallplayers.Thesegamesmaybesolvedbyrecursivelycomputingtheoptimalvaluefunctioninasearchtreecontainingapproximatelybdpossiblesequencesofmoves,wherebisthegame’sbreadth(number1oflegalmovesperposition)anddisitsdepth(gamelength).Inlargegames,suchaschess(b35;d80)1andespeciallyGo(b250;d150)1,exhaustivesearchisinfeasible2,3,buttheeffectivesearchspacecanbereducedbytwogeneralprinciples.First,thedepthofthesearchmaybereducedbypositionevaluation:truncatingthesearchtreeatstatesandreplacingthesubtreebelowsbyanapproximatevaluefunctionv(s)v(s)thatpredictstheoutcomefromstates.Thisapproachhasledtosuper-humanperformanceinchess4,checkers5andothello6,butitwasbelievedtobeintractableinGoduetothecomplexityofthegame7.Second,thebreadthofthesearchmaybereducedbysamplingactionsfromapolicyp(ajs)thatisaprobabilitydistributionoverpossiblemovesainpositions.Forexample,Monte-Carlorollouts8searchtomaximumdepthwithoutbranchingatall,bysamplinglongsequencesofactionsforbothplayersfromapolicyp.Averagingoversuchrolloutscanprovideaneffectivepositionevaluation,achievingsuper-humanperformanceinbackgammon8andScrabble9,andweakamateurlevelplayinGo10.Monte-Carlotreesearch(MCTS)11,12usesMonte-Carlorolloutstoestimatethevalueofeachstateinasearchtree.Asmoresimulationsareexecuted,thesearchtreegrowslargerandtherelevantvaluesbecomemoreaccurate.Thepolicyusedtoselectactionsduringsearchisalsoim-provedovertime,byselectingchildrenwithhighervalues.Asymptotically,thispolicyconvergestooptimalplay,andtheevaluationsconvergetotheoptimalvaluefunction12.ThestrongestcurrentGoprogramsarebasedonMCTS,enhancedbypoliciesthataretrainedtopredicthumanexpertmoves13.Thesepoliciesareusedtonarrowthesearchtoabeamofhighprobabilityactions,andtosampleactionsduringrollouts.Thisapproachhasachievedstrongamateurplay13–15.How-ever,priorworkhasbeenlimitedtoshallowpolicies13–15orvaluefunctions16basedonalinearcombinationofinputfeatures.Recently,deepconvolutionalneuralnetworkshaveachievedunprecedentedperformanceinvisualdomains:forexampleimageclassiﬁcation17,facerecognition18,andplayingAtarigames19.Theyusemanylayersofneurons,eacharrangedinoverlappingtiles,toconstructin-creasinglyabstract,localisedrepresentationsofanimage20.WeemployasimilararchitectureforthegameofGo.Wepassintheboardpositionasa1919imageanduseconvolutionallayers2toconstructarepresentationoftheposition.Weusetheseneuralnetworkstoreducetheeffectivedepthandbreadthofthesearchtree:evaluatingpositionsusingavaluenetwork,andsamplingactionsusingapolicynetwork.Wetraintheneuralnetworksusingapipelineconsistingofseveralstagesofmachinelearning(Figure1).Webeginbytrainingasupervisedlearning(SL)policynetwork,p,directlyfromexperthumanmoves.Thisprovidesfast,efﬁcientlearningupdateswithimmediatefeedbackandhighqualitygradients.Similartopriorwork13,15,wealsotrainafastpolicypthatcanrapidlysampleactionsduringrollouts.Next,wetrainareinforcementlearning(RL)policynetwork,p,thatimprovestheSLpolicynetworkbyoptimisingtheﬁnaloutcomeofgamesofself-play.Thisadjuststhepolicytowardsthecorrectgoalofwinninggames,ratherthanmaximizingpredictiveaccuracy.Finally,wetrainavaluenetworkvthatpredictsthewinnerofgamesplayedbytheRLpolicynetworkagainstitself.OurprogramAlphaGoefﬁcientlycombinesthepolicyandvaluenetworkswithMCTS.1SupervisedLearningofPolicyNetworksFortheﬁrststageofthetrainingpipeline,webuil