FloodSungsongrotek@qq.com2016.4.1AlphaGo4:12016.3.15•AlphaGo•AlphaGoAlphaGo•AlphaGoAI•AlphaGo•AlphaGo1AlphaGoMasteringtheGameofGowithDeepNeuralNetworksandTreeSearchMoveEvaluationInGoUsingDeepConvolutionNeuralNetworks,ICLR2015MCTS++••MCTS••EndtoEnd2AlphaGo•RolloutPolicy“”•SLPolicyNetwork“2.1SLPolicyNetworkPolicyNetworkCNN19x19x48Conv15x5x48x192Stride:1,Pad:219x19x192Conv2-Conv123x3x192x192Stride:1,Pad:119x19x192ReluReluConv13Filters:1x1x192x1Stride:1,Pad:019x19Softmax19x193,882,240parameters!!CNN•Input19x19x48)•Output•CostCross-EntropyLoss•TrainingSGD50GPU3•TrainingStep340,000,000•BatchSize:16•LearningRate:0.00380,000,0000.0015optimization(1==loglikelihood0==:Cross-EntropyLossOneHotEncodingTensorflowTensorflow3111888881110101StonecolourTurnsince143268121057911OneHotencodingP1P2P3P4P5P6P7P8Liberties545444245552Capture/selfatari2Ladder212Sensibleness•16KGSGo692940•100•2840•-8•8•8•AlphaGo!,•57.0%•55.7%•3ms•ItisclearthattheneuralnetworkhasimplicitlyunderstoodmanysophisticatedaspectsofGo,includegoodshape,Fuseki,Joseki,Tesuji,Kofights,territoryandinfluence.•Itisremarkablethatasingle,unified,straightforwardarchitecturecanmastertheseelementofthegametosuchadegree,andwithoutanyexplicitlookahead.UnreasonableEffective!2.2RolloutPolicyNetworkRolloutNetwork•MCTS••softmax•••24.2%•213AlphaGo•RLPolicyNetwork“”•ValueNetwork“3.1RLPolicyNetwork•Step1•Step2•Step3REINFORCE•Step4500Step2PolicyGradientRL••Value•Return•Value•Reward01-1PolicyGradientRL•Max••PolicyGradientTheoremREINFORCE•ReturnGtQ•AlphaGO1•jiREINFORCE•PolicyGradientbaselinefunctionREINFORCE•minibatch•SLPolicyNetwork80%3.2ValueNetwork•••CNN-Regression•SLPolicyNetworkRLPolicynetwork•30,000,000•Step1U•Step21U-1SLPolicyNetwork•Step3:U•Step4UTRLPolicyNetwork•Step5CNN19x19x48Conv15x5x48x192Stride:1,Pad:219x19x192Conv2-Conv123x3x192x192Stride:1,Pad:119x19x192ReluReluConv13Filters:1x1x192x1Stride:1,Pad:019x192561ReluTanh•MSE•4AlphaGo•Step1:•Step2:•Step3AlphaGoMCTS•AlphaGo•Step1:SLPolicyNetworkLTreePolicyTreePolicyRolloutPolicyAlphaGo•Step2LValueNetworkRolloutNetwork0.5LAlphaGo•Step3:QQAlphaGo•Step4QSLPolicyNetworkQAlphaGo•Step5nAlphaGo••ValueNetworkSLPolicyNetworkAlphaGoAlphaGovsLeeSedolRound4SLRLPolicyNetwork”“AlphaGo87ValueNetworkMCTS78ValueNetwork78AlphaGoAlphaGoAIAlphaGoMCTSNNAlphaGo“”AlphaGoAlphaGoArtificialGeneralIntelligenceAlphaGoAlphaGoNNAlphaGoAINN+RL=TrueAIforSpecialPurposeRNNTuringComplete,OneShotLearningUnsupervisedLearning•[1]•[2]•[3]Silver,David,etal.MasteringthegameofGowithdeepneuralnetworksandtreesearch.Nature529.7587(2016):484-489.•[4]Maddison,ChrisJ.,etal.Moveevaluationingousingdeepconvolutionalneuralnetworks.arXivpreprintarXiv:1412.6564(2014).ThankYou