AlphaGo-Zero论文讲解

wdlcyjy002
20 ℃
2020-04-13

整理文档很辛苦，赏杯茶钱您下走！

还剩 ... 页未读，继续阅读 >>

免费阅读已结束，点击下载阅读编辑剩下 ... 页

阅读已结束，您可以下载文档离线阅读编辑

资源描述

MasterthegameofGowithouthumanknowledgePresentation:邢翔瑞2017.11.13DeepMind,5NewStreetSquare,LondonEC4A3TW,UK.Nature2017.Oct19目录01Motivation02Methods03Experiments04ConclusionMotivation0101Along-standinggoalofartificialintelligenceisanalgorithmthatlearns,tabularasa,superhumanproficiencyinchallengingdomains.02Expertdatasetsareoftenexpensive,unreliableorsimplyunavailable.03SupervisedlearningImposeaceilingontheperformanceofsystemstrainedinthismanner.MotivationMethods0201Firstandforemost,itistrainedsolelybyselfplayreinforcementlearning,startingfromrandomplay,withoutanysupervisionoruseofhumandata02Second,itusesonlytheblackandwhitestonesfromtheboardasinputfeatures.03Third,itusesasingleneuralnetwork,ratherthanseparatepolicyandvaluenetworks.DifferencesformAlphaGo01DeepNeuralNetwork02Withparameter03Inputtherawboardrepresentationofthepositionanditshistory04OutputsbothmoveprobilitiesandavalueNetwork0506representstheprobabilityofselectingeachmovea(includingpass),Thevectorofmoveprobabilities06scalarevaluation,estimatingtheprobabilityofthecurrentplayerwinningfrompositionsTheneuralnetworkconsistsofmanyresidualblocksofconvolutionallayerswithbatchnormalizationandrectifiernonlinearitiesPowerfulpolicyimprovementoperatorThesesearchprobabilitiesusuallyselectmuchstrongermovesthantherawmoveprobabilitiespoftheneuralnetworkGamewinnerzasasampleofthevalueSelf-playtrainingpipelineMCTSTheMCTSsearchoutputsprobabilitiesπofplayingeachmove.PowerfulpolicyevaluationoperatorThemainideaofourreinforcementlearningalgorithmistousethesesearchoperatorsrepeatedlyinapolicyiterationprocedure.morecloselymatchtheimprovedsearchprobabilitiesandself-playwinnerSelf-playreinforcementlearningMCTSinAlphaGoZeroExperiments03Empiricalevaluation4TPUs48TPUs36HoursSeveralMonthsFinalperformenceofAlphaGoZeroandAlphaGoLeeAlphaGoZero4.9milliongamesofselfplayweregenerated,using1,600imulationsforeachMCTS,whichorrespondstoapproximately0.4sthinkingtimepermove.Parameterswereupdatedfrom700,000minibatchesof2,048positions.Theneuralnetworkcontained20residualblocksPerformanceofAlphaGoZeroConclusion04Conclusion01Apurereinforcementlearningapproachisfullyfeasible,eveninthemostchallengingofdomains02Itispossibletotraintosuperhumanlevel,withouthumanexamplesorguidance,givennoknowledgeofthedomainbeyondbasicrules.03HumankindhasaccumulatedGoknowledgefrommillionsofgamesplayedoverthousandsofyears,collectivelydistilledintopatterns,proverbsandbooks.Inthespaceofafewdays,startingtabularasa,AlphaGoZerowasabletorediscovermuchofthisGoknowledge,aswellasnovelstrategiesthatprovidenewinsightsintotheoldestofgames.