UnderreviewasaconferencepaperatICLR2018ONTHECONVERGENCEOFADAMANDBEYONDAnonymousauthorsPaperunderdouble-blindreviewABSTRACTSeveralrecentlyproposedstochasticoptimizationmethodsthathavebeensuc-cessfullyusedintrainingdeepnetworkssuchasRMSPROP,ADAM,ADADELTA,NADAM,etcarebasedonusinggradientupdatesscaledbysquarerootsofex-ponentialmovingaveragesofsquaredpastgradients.Ithasbeenempiricallyob-servedthatsometimesthesealgorithmsfailtoconvergetoanoptimalsolution(oracriticalpointinnonconvexsettings).Weshowthatonecauseforsuchfailuresistheexponentialmovingaverageusedinthealgorithms.WeprovideanexplicitexampleofasimpleconvexoptimizationsettingwhereADAMdoesnotconvergetotheoptimalsolution,anddescribethepreciseproblemswiththepreviousanal-ysisofADAMalgorithm.Ouranalysissuggeststhattheconvergenceissuesmaybefixedbyendowingsuchalgorithmswith“long-termmemory”ofpastgradi-ents,andproposenewvariantsoftheADAMalgorithmwhichnotonlyfixtheconvergenceissuesbutoftenalsoleadtoimprovedempiricalperformance.1INTRODUCTIONStochasticgradientdescent(SGD)isthedominantmethodtotraindeepnetworkstoday.Thismethoditerativelyupdatestheparametersofamodelbymovingtheminthedirectionofthenegativegra-dientofthelossevaluatedonaminibatch.Inparticular,variantsofSGDthatscalecoordinatesofthegradientbysquarerootsofsomeformofaveragingofthesquaredcoordinatesinthepastgradientshavebeenparticularlysuccessful,becausetheyautomaticallyadjustthelearningrateonaper-featurebasis.ThefirstpopularalgorithminthislineofresearchisADAGRAD(Duchietal.,2011;McMahan&Streeter,2010),whichcanachievesignificantlybetterperformancecomparedtovanillaSGDwhenthegradientsaresparse,oringeneralsmall.AlthoughADAGRADworkswellforsparsesettings,itsperformancehasbeenobservedtodeteriorateinsettingswherethelossfunctionsarenonconvexandgradientsaredenseduetorapiddecayofthelearningrateinthesesettingssinceitusesallthepastgradientsintheupdate.Thisproblemisespeciallyexacerbatedinhighdimensionalproblemsarisingindeeplearning.Totacklethisissue,severalvariantsofADAGRAD,suchasRMSPROP(Tieleman&Hinton,2012),ADAM(Kingma&Ba,2015),ADADELTA(Zeiler,2012),NADAM(Dozat,2016),etc,havebeenproposedwhichmitigatetherapiddecayofthelearningrateusingtheexponentialmovingaveragesofsquaredpastgradients,essentiallylimitingtherelianceoftheupdatetoonlythepastfewgradients.Whilethesealgorithmshavebeensuccessfullyemployedinseveralpracticalapplications,theyhavealsobeenobservedtonotconvergeinsomeothersettings.Ithasbeentypicallyobservedthatinthesesettingssomeminibatchesprovidelargegradientsbutonlyquiterarely,andwhiletheselargegradientsarequiteinformative,theirinfluencediesoutratherquicklyduetotheexponentialaveraging,thusleadingtopoorconvergence.Inthispaper,weanalyzethissituationindetail.Werigorouslyprovethattheintuitionconveyedintheaboveparagraphisindeedcorrect;thatlimitingtherelianceoftheupdateonessentiallyonlythepastfewgradientscanindeedcausesignificantconvergenceissues.Inparticular,wemakethefollowingkeycontributions:WeelucidatehowtheexponentialmovingaverageintheRMSPROPandADAMalgorithmscancausenon-convergencebyprovidinganexampleofsimpleconvexoptimizationprob-lemwhereRMSPROPandADAMprovablydonotconvergetoanoptimalsolution.OuranalysiseasilyextendstootheralgorithmsusingexponentialmovingaveragessuchasADADELTAandNADAMaswell,butweomitthisforthesakeofclarity.Infact,the1UnderreviewasaconferencepaperatICLR2018analysisisflexibleenoughtoextendtootheralgorithmsthatemployaveragingsquaredgradientsoveressentiallyafixedsizewindow(forexponentialmovingaverages,theinflu-encesofgradientsbeyondafixedwindowsizebecomesnegligiblysmall)intheimmediatepast.Weomitthegeneralanalysisinthispaperforthesakeofclarity.Theaboveresultindicatesthatinordertohaveguaranteedconvergencetheoptimizationalgorithmmusthave“long-termmemory”ofpastgradients.Specifically,wepointoutaproblemwiththeproofofconvergenceoftheADAMalgorithmgivenbyKingma&Ba(2015).Toresolvethisissue,weproposenewvariantsofADAMwhichrelyonlong-termmemoryofpastgradients,butcanbeimplementedinthesametimeandspacerequirementsastheoriginalADAMalgorithm.Weprovideaconvergenceanalysisforthenewvariantsintheconvexsetting,basedontheanalysisofKingma&Ba(2015),andshowadata-dependentregretboundsimilartotheoneinADAGRAD.Weprovideapreliminaryempiricalstudyofoneofthevariantsweproposedandshowthatiteitherperformssimilarly,orbetter,onsomecommonlyusedproblemsinmachinelearning.2PRELIMINARIESNotation.WeuseS+dtodenotethesetofallpositivedefiniteddmatrices.Withslightabuseofnotation,foravectora2RdandapositivedefinitematrixM2RdRd,weusea=MtodenoteM 1a,kMik2todenote`2-normofithrowofMandpMtorepresentM1=2.Furthermore,foranyvectorsa;b2Rd,weusepaforelement-wisesquareroot,a2forelement-wisesquare,a=btodenoteelement-wisedivisionandmax(a;b)todenoteelement-wisemaximum.Foranyvectori2Rd,i;jdenotesitsjthcoordinatewherej2[d].TheprojectionoperationF;A(y)forA2Sd+isdefinedasargminx2FkA1=2(x y)kfory2Rd.Finally,wesayFhasboundeddiameterD1ifkx yk1D1forallx;y2F.Optimizationsetup.Aflexibleframeworktoanalyzeiterativeoptimizationmethodsistheonlineoptimizationprobleminthefullinformationfeedbackse