2018-ICLR-On the Convergence of Adam and Beyond

整理文档很辛苦,赏杯茶钱您下走!

免费阅读已结束,点击下载阅读编辑剩下 ...

阅读已结束,您可以下载文档离线阅读编辑

资源描述

UnderreviewasaconferencepaperatICLR2018ONTHECONVERGENCEOFADAMANDBEYONDAnonymousauthorsPaperunderdouble-blindreviewABSTRACTSeveralrecentlyproposedstochasticoptimizationmethodsthathavebeensuc-cessfullyusedintrainingdeepnetworkssuchasRMSPROP,ADAM,ADADELTA,NADAM,etcarebasedonusinggradientupdatesscaledbysquarerootsofex-ponentialmovingaveragesofsquaredpastgradients.Ithasbeenempiricallyob-servedthatsometimesthesealgorithmsfailtoconvergetoanoptimalsolution(oracriticalpointinnonconvexsettings).Weshowthatonecauseforsuchfailuresistheexponentialmovingaverageusedinthealgorithms.WeprovideanexplicitexampleofasimpleconvexoptimizationsettingwhereADAMdoesnotconvergetotheoptimalsolution,anddescribethepreciseproblemswiththepreviousanal-ysisofADAMalgorithm.Ouranalysissuggeststhattheconvergenceissuesmaybefixedbyendowingsuchalgorithmswith“long-termmemory”ofpastgradi-ents,andproposenewvariantsoftheADAMalgorithmwhichnotonlyfixtheconvergenceissuesbutoftenalsoleadtoimprovedempiricalperformance.1INTRODUCTIONStochasticgradientdescent(SGD)isthedominantmethodtotraindeepnetworkstoday.Thismethoditerativelyupdatestheparametersofamodelbymovingtheminthedirectionofthenegativegra-dientofthelossevaluatedonaminibatch.Inparticular,variantsofSGDthatscalecoordinatesofthegradientbysquarerootsofsomeformofaveragingofthesquaredcoordinatesinthepastgradientshavebeenparticularlysuccessful,becausetheyautomaticallyadjustthelearningrateonaper-featurebasis.ThefirstpopularalgorithminthislineofresearchisADAGRAD(Duchietal.,2011;McMahan&Streeter,2010),whichcanachievesignificantlybetterperformancecomparedtovanillaSGDwhenthegradientsaresparse,oringeneralsmall.AlthoughADAGRADworkswellforsparsesettings,itsperformancehasbeenobservedtodeteriorateinsettingswherethelossfunctionsarenonconvexandgradientsaredenseduetorapiddecayofthelearningrateinthesesettingssinceitusesallthepastgradientsintheupdate.Thisproblemisespeciallyexacerbatedinhighdimensionalproblemsarisingindeeplearning.Totacklethisissue,severalvariantsofADAGRAD,suchasRMSPROP(Tieleman&Hinton,2012),ADAM(Kingma&Ba,2015),ADADELTA(Zeiler,2012),NADAM(Dozat,2016),etc,havebeenproposedwhichmitigatetherapiddecayofthelearningrateusingtheexponentialmovingaveragesofsquaredpastgradients,essentiallylimitingtherelianceoftheupdatetoonlythepastfewgradients.Whilethesealgorithmshavebeensuccessfullyemployedinseveralpracticalapplications,theyhavealsobeenobservedtonotconvergeinsomeothersettings.Ithasbeentypicallyobservedthatinthesesettingssomeminibatchesprovidelargegradientsbutonlyquiterarely,andwhiletheselargegradientsarequiteinformative,theirinfluencediesoutratherquicklyduetotheexponentialaveraging,thusleadingtopoorconvergence.Inthispaper,weanalyzethissituationindetail.Werigorouslyprovethattheintuitionconveyedintheaboveparagraphisindeedcorrect;thatlimitingtherelianceoftheupdateonessentiallyonlythepastfewgradientscanindeedcausesignificantconvergenceissues.Inparticular,wemakethefollowingkeycontributions:WeelucidatehowtheexponentialmovingaverageintheRMSPROPandADAMalgorithmscancausenon-convergencebyprovidinganexampleofsimpleconvexoptimizationprob-lemwhereRMSPROPandADAMprovablydonotconvergetoanoptimalsolution.OuranalysiseasilyextendstootheralgorithmsusingexponentialmovingaveragessuchasADADELTAandNADAMaswell,butweomitthisforthesakeofclarity.Infact,the1UnderreviewasaconferencepaperatICLR2018analysisisflexibleenoughtoextendtootheralgorithmsthatemployaveragingsquaredgradientsoveressentiallyafixedsizewindow(forexponentialmovingaverages,theinflu-encesofgradientsbeyondafixedwindowsizebecomesnegligiblysmall)intheimmediatepast.Weomitthegeneralanalysisinthispaperforthesakeofclarity.Theaboveresultindicatesthatinordertohaveguaranteedconvergencetheoptimizationalgorithmmusthave“long-termmemory”ofpastgradients.Specifically,wepointoutaproblemwiththeproofofconvergenceoftheADAMalgorithmgivenbyKingma&Ba(2015).Toresolvethisissue,weproposenewvariantsofADAMwhichrelyonlong-termmemoryofpastgradients,butcanbeimplementedinthesametimeandspacerequirementsastheoriginalADAMalgorithm.Weprovideaconvergenceanalysisforthenewvariantsintheconvexsetting,basedontheanalysisofKingma&Ba(2015),andshowadata-dependentregretboundsimilartotheoneinADAGRAD.Weprovideapreliminaryempiricalstudyofoneofthevariantsweproposedandshowthatiteitherperformssimilarly,orbetter,onsomecommonlyusedproblemsinmachinelearning.2PRELIMINARIESNotation.WeuseS+dtodenotethesetofallpositivedefiniteddmatrices.Withslightabuseofnotation,foravectora2RdandapositivedefinitematrixM2RdRd,weusea=MtodenoteM1a,kMik2todenote`2-normofithrowofMandpMtorepresentM1=2.Furthermore,foranyvectorsa;b2Rd,weusepaforelement-wisesquareroot,a2forelement-wisesquare,a=btodenoteelement-wisedivisionandmax(a;b)todenoteelement-wisemaximum.Foranyvectori2Rd,i;jdenotesitsjthcoordinatewherej2[d].TheprojectionoperationF;A(y)forA2Sd+isdefinedasargminx2FkA1=2(xy)kfory2Rd.Finally,wesayFhasboundeddiameterD1ifkxyk1D1forallx;y2F.Optimizationsetup.Aflexibleframeworktoanalyzeiterativeoptimizationmethodsistheonlineoptimizationprobleminthefullinformationfeedbackse

1 / 23
下载文档,编辑使用

©2015-2020 m.777doc.com 三七文档.

备案号:鲁ICP备2024069028号-1 客服联系 QQ:2149211541

×
保存成功