MasteringtheGameofGowithDeepNeuralNetworksandTreeSearchDavidSilver1*,AjaHuang1*,ChrisJ.Maddison1,ArthurGuez1,LaurentSifre1,GeorgevandenDriessche1,JulianSchrittwieser1,IoannisAntonoglou1,VedaPanneershelvam1,MarcLanctot1,SanderDieleman1,DominikGrewe1,JohnNham2,NalKalchbrenner1,IlyaSutskever2,TimothyLillicrap1,MadeleineLeach1,KorayKavukcuoglu1,ThoreGraepel1,DemisHassabis1.1GoogleDeepMind,5NewStreetSquare,LondonEC4A3TW.2Google,1600AmphitheatreParkway,MountainViewCA94043.*Theseauthorscontributedequallytothiswork.CorrespondenceshouldbeaddressedtoeitherDavidSilver(davidsilver@google.com)orDemisHassabis(demishassabis@google.com).ThegameofGohaslongbeenviewedasthemostchallengingofclassicgamesforar-tificialintelligenceduetoitsenormoussearchspaceandthedifficultyofevaluatingboardpositionsandmoves.WeintroduceanewapproachtocomputerGothatusesvaluenetworkstoevaluateboardpositionsandpolicynetworkstoselectmoves.Thesedeepneuralnetworksaretrainedbyanovelcombinationofsupervisedlearningfromhumanexpertgames,andreinforcementlearningfromgamesofself-play.Withoutanylookaheadsearch,theneuralnetworksplayGoatthelevelofstate-of-the-artMonte-Carlotreesearchprogramsthatsim-ulatethousandsofrandomgamesofself-play.WealsointroduceanewsearchalgorithmthatcombinesMonte-Carlosimulationwithvalueandpolicynetworks.Usingthissearchal-gorithm,ourprogramAlphaGoachieveda99.8%winningrateagainstotherGoprograms,anddefeatedtheEuropeanGochampionby5gamesto0.Thisisthefirsttimethatacom-puterprogramhasdefeatedahumanprofessionalplayerinthefull-sizedgameofGo,afeatpreviouslythoughttobeatleastadecadeaway.Allgamesofperfectinformationhaveanoptimalvaluefunction,v(s),whichdeterminestheoutcomeofthegame,fromeveryboardpositionorstates,underperfectplaybyallplayers.Thesegamesmaybesolvedbyrecursivelycomputingtheoptimalvaluefunctioninasearchtreecontainingapproximatelybdpossiblesequencesofmoves,wherebisthegame’sbreadth(number1oflegalmovesperposition)anddisitsdepth(gamelength).Inlargegames,suchaschess(b35;d80)1andespeciallyGo(b250;d150)1,exhaustivesearchisinfeasible2,3,buttheeffectivesearchspacecanbereducedbytwogeneralprinciples.First,thedepthofthesearchmaybereducedbypositionevaluation:truncatingthesearchtreeatstatesandreplacingthesubtreebelowsbyanapproximatevaluefunctionv(s)v(s)thatpredictstheoutcomefromstates.Thisapproachhasledtosuper-humanperformanceinchess4,checkers5andothello6,butitwasbelievedtobeintractableinGoduetothecomplexityofthegame7.Second,thebreadthofthesearchmaybereducedbysamplingactionsfromapolicyp(ajs)thatisaprobabilitydistributionoverpossiblemovesainpositions.Forexample,Monte-Carlorollouts8searchtomaximumdepthwithoutbranchingatall,bysamplinglongsequencesofactionsforbothplayersfromapolicyp.Averagingoversuchrolloutscanprovideaneffectivepositionevaluation,achievingsuper-humanperformanceinbackgammon8andScrabble9,andweakamateurlevelplayinGo10.Monte-Carlotreesearch(MCTS)11,12usesMonte-Carlorolloutstoestimatethevalueofeachstateinasearchtree.Asmoresimulationsareexecuted,thesearchtreegrowslargerandtherelevantvaluesbecomemoreaccurate.Thepolicyusedtoselectactionsduringsearchisalsoim-provedovertime,byselectingchildrenwithhighervalues.Asymptotically,thispolicyconvergestooptimalplay,andtheevaluationsconvergetotheoptimalvaluefunction12.ThestrongestcurrentGoprogramsarebasedonMCTS,enhancedbypoliciesthataretrainedtopredicthumanexpertmoves13.Thesepoliciesareusedtonarrowthesearchtoabeamofhighprobabilityactions,andtosampleactionsduringrollouts.Thisapproachhasachievedstrongamateurplay13–15.How-ever,priorworkhasbeenlimitedtoshallowpolicies13–15orvaluefunctions16basedonalinearcombinationofinputfeatures.Recently,deepconvolutionalneuralnetworkshaveachievedunprecedentedperformanceinvisualdomains:forexampleimageclassification17,facerecognition18,andplayingAtarigames19.Theyusemanylayersofneurons,eacharrangedinoverlappingtiles,toconstructin-creasinglyabstract,localisedrepresentationsofanimage20.WeemployasimilararchitectureforthegameofGo.Wepassintheboardpositionasa1919imageanduseconvolutionallayers2toconstructarepresentationoftheposition.Weusetheseneuralnetworkstoreducetheeffectivedepthandbreadthofthesearchtree:evaluatingpositionsusingavaluenetwork,andsamplingactionsusingapolicynetwork.Wetraintheneuralnetworksusingapipelineconsistingofseveralstagesofmachinelearning(Figure1).Webeginbytrainingasupervisedlearning(SL)policynetwork,p,directlyfromexperthumanmoves.Thisprovidesfast,efficientlearningupdateswithimmediatefeedbackandhighqualitygradients.Similartopriorwork13,15,wealsotrainafastpolicypthatcanrapidlysampleactionsduringrollouts.Next,wetrainareinforcementlearning(RL)policynetwork,p,thatimprovestheSLpolicynetworkbyoptimisingthefinaloutcomeofgamesofself-play.Thisadjuststhepolicytowardsthecorrectgoalofwinninggames,ratherthanmaximizingpredictiveaccuracy.Finally,wetrainavaluenetworkvthatpredictsthewinnerofgamesplayedbytheRLpolicynetworkagainstitself.OurprogramAlphaGoefficientlycombinesthepolicyandvaluenetworkswithMCTS.1SupervisedLearningofPolicyNetworksForthefirststageofthetrainingpipeline,webuil