AlphaGo-Zero论文讲解

整理文档很辛苦,赏杯茶钱您下走!

免费阅读已结束,点击下载阅读编辑剩下 ...

阅读已结束,您可以下载文档离线阅读编辑

资源描述

MasterthegameofGowithouthumanknowledgePresentation:邢翔瑞2017.11.13DeepMind,5NewStreetSquare,LondonEC4A3TW,UK.Nature2017.Oct19目录01Motivation02Methods03Experiments04ConclusionMotivation0101Along-standinggoalofartificialintelligenceisanalgorithmthatlearns,tabularasa,superhumanproficiencyinchallengingdomains.02Expertdatasetsareoftenexpensive,unreliableorsimplyunavailable.03SupervisedlearningImposeaceilingontheperformanceofsystemstrainedinthismanner.MotivationMethods0201Firstandforemost,itistrainedsolelybyselfplayreinforcementlearning,startingfromrandomplay,withoutanysupervisionoruseofhumandata02Second,itusesonlytheblackandwhitestonesfromtheboardasinputfeatures.03Third,itusesasingleneuralnetwork,ratherthanseparatepolicyandvaluenetworks.DifferencesformAlphaGo01DeepNeuralNetwork02Withparameter03Inputtherawboardrepresentationofthepositionanditshistory04OutputsbothmoveprobilitiesandavalueNetwork0506representstheprobabilityofselectingeachmovea(includingpass),Thevectorofmoveprobabilities06scalarevaluation,estimatingtheprobabilityofthecurrentplayerwinningfrompositionsTheneuralnetworkconsistsofmanyresidualblocksofconvolutionallayerswithbatchnormalizationandrectifiernonlinearitiesPowerfulpolicyimprovementoperatorThesesearchprobabilitiesusuallyselectmuchstrongermovesthantherawmoveprobabilitiespoftheneuralnetworkGamewinnerzasasampleofthevalueSelf-playtrainingpipelineMCTSTheMCTSsearchoutputsprobabilitiesπofplayingeachmove.PowerfulpolicyevaluationoperatorThemainideaofourreinforcementlearningalgorithmistousethesesearchoperatorsrepeatedlyinapolicyiterationprocedure.morecloselymatchtheimprovedsearchprobabilitiesandself-playwinnerSelf-playreinforcementlearningMCTSinAlphaGoZeroExperiments03Empiricalevaluation4TPUs48TPUs36HoursSeveralMonthsFinalperformenceofAlphaGoZeroandAlphaGoLeeAlphaGoZero4.9milliongamesofselfplayweregenerated,using1,600imulationsforeachMCTS,whichorrespondstoapproximately0.4sthinkingtimepermove.Parameterswereupdatedfrom700,000minibatchesof2,048positions.Theneuralnetworkcontained20residualblocksPerformanceofAlphaGoZeroConclusion04Conclusion01Apurereinforcementlearningapproachisfullyfeasible,eveninthemostchallengingofdomains02Itispossibletotraintosuperhumanlevel,withouthumanexamplesorguidance,givennoknowledgeofthedomainbeyondbasicrules.03HumankindhasaccumulatedGoknowledgefrommillionsofgamesplayedoverthousandsofyears,collectivelydistilledintopatterns,proverbsandbooks.Inthespaceofafewdays,startingtabularasa,AlphaGoZerowasabletorediscovermuchofthisGoknowledge,aswellasnovelstrategiesthatprovidenewinsightsintotheoldestofgames.

1 / 17
下载文档,编辑使用

©2015-2020 m.777doc.com 三七文档.

备案号:鲁ICP备2024069028号-1 客服联系 QQ:2149211541

×
保存成功