MasterthegameofGowithouthumanknowledgePresentation:邢翔瑞2017.11.13DeepMind,5NewStreetSquare,LondonEC4A3TW,UK.Nature2017.Oct19目录01Motivation02Methods03Experiments04ConclusionMotivation0101Along-standinggoalofartificialintelligenceisanalgorithmthatlearns,tabularasa,superhumanproficiencyinchallengingdomains.02Expertdatasetsareoftenexpensive,unreliableorsimplyunavailable.03SupervisedlearningImposeaceilingontheperformanceofsystemstrainedinthismanner.MotivationMethods0201Firstandforemost,itistrainedsolelybyselfplayreinforcementlearning,startingfromrandomplay,withoutanysupervisionoruseofhumandata02Second,itusesonlytheblackandwhitestonesfromtheboardasinputfeatures.03Third,itusesasingleneuralnetwork,ratherthanseparatepolicyandvaluenetworks.DifferencesformAlphaGo01DeepNeuralNetwork02Withparameter03Inputtherawboardrepresentationofthepositionanditshistory04OutputsbothmoveprobilitiesandavalueNetwork0506representstheprobabilityofselectingeachmovea(includingpass),Thevectorofmoveprobabilities06scalarevaluation,estimatingtheprobabilityofthecurrentplayerwinningfrompositionsTheneuralnetworkconsistsofmanyresidualblocksofconvolutionallayerswithbatchnormalizationandrectifiernonlinearitiesPowerfulpolicyimprovementoperatorThesesearchprobabilitiesusuallyselectmuchstrongermovesthantherawmoveprobabilitiespoftheneuralnetworkGamewinnerzasasampleofthevalueSelf-playtrainingpipelineMCTSTheMCTSsearchoutputsprobabilitiesπofplayingeachmove.PowerfulpolicyevaluationoperatorThemainideaofourreinforcementlearningalgorithmistousethesesearchoperatorsrepeatedlyinapolicyiterationprocedure.morecloselymatchtheimprovedsearchprobabilitiesandself-playwinnerSelf-playreinforcementlearningMCTSinAlphaGoZeroExperiments03Empiricalevaluation4TPUs48TPUs36HoursSeveralMonthsFinalperformenceofAlphaGoZeroandAlphaGoLeeAlphaGoZero4.9milliongamesofselfplayweregenerated,using1,600imulationsforeachMCTS,whichorrespondstoapproximately0.4sthinkingtimepermove.Parameterswereupdatedfrom700,000minibatchesof2,048positions.Theneuralnetworkcontained20residualblocksPerformanceofAlphaGoZeroConclusion04Conclusion01Apurereinforcementlearningapproachisfullyfeasible,eveninthemostchallengingofdomains02Itispossibletotraintosuperhumanlevel,withouthumanexamplesorguidance,givennoknowledgeofthedomainbeyondbasicrules.03HumankindhasaccumulatedGoknowledgefrommillionsofgamesplayedoverthousandsofyears,collectivelydistilledintopatterns,proverbsandbooks.Inthespaceofafewdays,startingtabularasa,AlphaGoZerowasabletorediscovermuchofthisGoknowledge,aswellasnovelstrategiesthatprovidenewinsightsintotheoldestofgames.