2018-ICLR-On the Convergence of Adam and Beyond

zhyp8
1 ℃
2020-02-28

整理文档很辛苦，赏杯茶钱您下走！

还剩 ... 页未读，继续阅读 >>

免费阅读已结束，点击下载阅读编辑剩下 ... 页

阅读已结束，您可以下载文档离线阅读编辑

资源描述

UnderreviewasaconferencepaperatICLR2018ONTHECONVERGENCEOFADAMANDBEYONDAnonymousauthorsPaperunderdouble-blindreviewABSTRACTSeveralrecentlyproposedstochasticoptimizationmethodsthathavebeensuc-cessfullyusedintrainingdeepnetworkssuchasRMSPROP,ADAM,ADADELTA,NADAM,etcarebasedonusinggradientupdatesscaledbysquarerootsofex-ponentialmovingaveragesofsquaredpastgradients.Ithasbeenempiricallyob-servedthatsometimesthesealgorithmsfailtoconvergetoanoptimalsolution(oracriticalpointinnonconvexsettings).Weshowthatonecauseforsuchfailuresistheexponentialmovingaverageusedinthealgorithms.WeprovideanexplicitexampleofasimpleconvexoptimizationsettingwhereADAMdoesnotconvergetotheoptimalsolution,anddescribethepreciseproblemswiththepreviousanal-ysisofADAMalgorithm.Ouranalysissuggeststhattheconvergenceissuesmaybeﬁxedbyendowingsuchalgorithmswith“long-termmemory”ofpastgradi-ents,andproposenewvariantsoftheADAMalgorithmwhichnotonlyﬁxtheconvergenceissuesbutoftenalsoleadtoimprovedempiricalperformance.1INTRODUCTIONStochasticgradientdescent(SGD)isthedominantmethodtotraindeepnetworkstoday.Thismethoditerativelyupdatestheparametersofamodelbymovingtheminthedirectionofthenegativegra-dientofthelossevaluatedonaminibatch.Inparticular,variantsofSGDthatscalecoordinatesofthegradientbysquarerootsofsomeformofaveragingofthesquaredcoordinatesinthepastgradientshavebeenparticularlysuccessful,becausetheyautomaticallyadjustthelearningrateonaper-featurebasis.TheﬁrstpopularalgorithminthislineofresearchisADAGRAD(Duchietal.,2011;McMahan&Streeter,2010),whichcanachievesigniﬁcantlybetterperformancecomparedtovanillaSGDwhenthegradientsaresparse,oringeneralsmall.AlthoughADAGRADworkswellforsparsesettings,itsperformancehasbeenobservedtodeteriorateinsettingswherethelossfunctionsarenonconvexandgradientsaredenseduetorapiddecayofthelearningrateinthesesettingssinceitusesallthepastgradientsintheupdate.Thisproblemisespeciallyexacerbatedinhighdimensionalproblemsarisingindeeplearning.Totacklethisissue,severalvariantsofADAGRAD,suchasRMSPROP(Tieleman&Hinton,2012),ADAM(Kingma&Ba,2015),ADADELTA(Zeiler,2012),NADAM(Dozat,2016),etc,havebeenproposedwhichmitigatetherapiddecayofthelearningrateusingtheexponentialmovingaveragesofsquaredpastgradients,essentiallylimitingtherelianceoftheupdatetoonlythepastfewgradients.Whilethesealgorithmshavebeensuccessfullyemployedinseveralpracticalapplications,theyhavealsobeenobservedtonotconvergeinsomeothersettings.Ithasbeentypicallyobservedthatinthesesettingssomeminibatchesprovidelargegradientsbutonlyquiterarely,andwhiletheselargegradientsarequiteinformative,theirinﬂuencediesoutratherquicklyduetotheexponentialaveraging,thusleadingtopoorconvergence.Inthispaper,weanalyzethissituationindetail.Werigorouslyprovethattheintuitionconveyedintheaboveparagraphisindeedcorrect;thatlimitingtherelianceoftheupdateonessentiallyonlythepastfewgradientscanindeedcausesigniﬁcantconvergenceissues.Inparticular,wemakethefollowingkeycontributions:WeelucidatehowtheexponentialmovingaverageintheRMSPROPandADAMalgorithmscancausenon-convergencebyprovidinganexampleofsimpleconvexoptimizationprob-lemwhereRMSPROPandADAMprovablydonotconvergetoanoptimalsolution.OuranalysiseasilyextendstootheralgorithmsusingexponentialmovingaveragessuchasADADELTAandNADAMaswell,butweomitthisforthesakeofclarity.Infact,the1UnderreviewasaconferencepaperatICLR2018analysisisﬂexibleenoughtoextendtootheralgorithmsthatemployaveragingsquaredgradientsoveressentiallyaﬁxedsizewindow(forexponentialmovingaverages,theinﬂu-encesofgradientsbeyondaﬁxedwindowsizebecomesnegligiblysmall)intheimmediatepast.Weomitthegeneralanalysisinthispaperforthesakeofclarity.Theaboveresultindicatesthatinordertohaveguaranteedconvergencetheoptimizationalgorithmmusthave“long-termmemory”ofpastgradients.Speciﬁcally,wepointoutaproblemwiththeproofofconvergenceoftheADAMalgorithmgivenbyKingma&Ba(2015).Toresolvethisissue,weproposenewvariantsofADAMwhichrelyonlong-termmemoryofpastgradients,butcanbeimplementedinthesametimeandspacerequirementsastheoriginalADAMalgorithm.Weprovideaconvergenceanalysisforthenewvariantsintheconvexsetting,basedontheanalysisofKingma&Ba(2015),andshowadata-dependentregretboundsimilartotheoneinADAGRAD.Weprovideapreliminaryempiricalstudyofoneofthevariantsweproposedandshowthatiteitherperformssimilarly,orbetter,onsomecommonlyusedproblemsinmachinelearning.2PRELIMINARIESNotation.WeuseS+dtodenotethesetofallpositivedeﬁniteddmatrices.Withslightabuseofnotation,foravectora2RdandapositivedeﬁnitematrixM2RdRd,weusea=MtodenoteM1a,kMik2todenote`2-normofithrowofMandpMtorepresentM1=2.Furthermore,foranyvectorsa;b2Rd,weusepaforelement-wisesquareroot,a2forelement-wisesquare,a=btodenoteelement-wisedivisionandmax(a;b)todenoteelement-wisemaximum.Foranyvectori2Rd,i;jdenotesitsjthcoordinatewherej2[d].TheprojectionoperationF;A(y)forA2Sd+isdeﬁnedasargminx2FkA1=2(xy)kfory2Rd.Finally,wesayFhasboundeddiameterD1ifkxyk1D1forallx;y2F.Optimizationsetup.Aﬂexibleframeworktoanalyzeiterativeoptimizationmethodsistheonlineoptimizationprobleminthefullinformationfeedbackse