JorgeNocedalNorthwesternUniversityIPAMSummerSchool2012TutorialonOptimizationmethodsformachinelearning21. Wediscusssomecharacteristicsofoptimizationproblemsarisingindeeplearning,convexlogisticregressionandinversecovarianceestimation.2.Therearemanytoolsatourdisposal:firstandsecondordermethods;batchandstochasticalgorithms;regularization;primalanddualapproaches;parallelcomputing3.Yetthestateofaffairswithneuralnetsisconfusingtome:toomanychallengesareconfrontedatonce:localvslocalminima,nonlinearity,stochasticity,initializations,heuristicsOverview3Weneedtoisolatequestionsrelatedtooptimization,andstudytheminacontrolledsettingAkeyquestionistounderstandthepropertiesofstochasticvsbatchmethodsinthecontextofdeeplearning.Aftersomeclarityisobtained,weneedtodevelopappropriatealgorithmsandworkcomplexitybounds,bothinasequentialandaparallelsettingOpenQuestions4IwilldiscussafewofoptimizationtechniquesandprovidesomeinsightsintotheirstrengthsandweaknessesWewillcontrastthedeterministicandstochasticsettings.ThismotivatesdynamicsampleselectionmethodsWeemphasizetheneedforgeneralpurposetechniques,secondordermethodsandscaleinvariance,vsheuristicsOrganization5Mostofmypracticalexperienceinoptimizationmethodsformachinelearningisforspeechrecognition(Google)ButIamawareofmanytestsdoneinavarietyofmachinelearningapplicationsduetomyinvolvementinL-BFGS,Newton-CGmethods(Hessian-free),dynamicsamplingmethods,etcIaminterestedindesigningnewoptimizationmethodsformachineapplications,inparticularfordeeplearningMyBackground6‘Thebestoptimizationalgorithmsarenotthebestlearningalgorithms’BottouandBousquet(2009)…whentakingintoaccounttheEstimationandOptimizationErrors``Stochasticmethodsbestinspiteofbeingworstoptimizationmethods”IntriguingStatements7ProblemCharacteristicsandSecondOrderMethodsPartI8NeuralnetsarefarmorenonlinearthanthefunctionsminimizedinmanyotherapplicationsFarabetTherateofconvergenceofanoptimizationalgorithmisstillimportanteventhoughinpracticeonestopstheiterationfarfromaminimizer”Nonlinearity,IllConditioning9Anobjectivefunction(Le,…Ng)Lossfunction(logistic,leastsquares,etc):Ignoreregularizationtermatfirst:unconstrainedoptimization10GeneralFormulationminJ(w)=1m(w;(zi,yi))i=1m∑+ν||w||1(w;(zi,yi)) (zi,yi)trainingdatazivectoroffeatures;yilabelsminf(x)TheNewton-CGMethod(HessianFree)∇2f(xk)p=−∇f(xk) xk+1=xk+αp- Thisisalinearsystemofequationsofsizen- ApplytheConjugateGradient(CG)methodtocomputeanapproximatesolutiontothissystem- CGisaniterativemethodendowedwithveryinterestingproperties(optimalKrylovmethod)- Hessianneednotbecomputedexplicitly- Continuumbetween1stand2ndordermethodsProblem:minf(x)Newton-CG:theConvexCase∇2f(xk)p=−∇f(xk)- WeshowbelowthatanynumberofCGiterationsyieldaproductivestep- Nottrueofotheriterativemethods- betterdirectionandlengththan1stordermethodsTheconjugategradientmethodmin φ(x)=12xTAx−bTx ∇φ(x)=Ax−bTwoequivalentproblemssolveAx=b r=Ax−b pk=−rk+βkpk−1 βk=pk−1TArkpk−1TApk−1OnlyproductApkisneededHessian-freeChoosesomeinitialpoint:x0Initialdirection:p0=−r0Forx0=0, -r0=bForthelinearsystem∇2f(xk)p=−∇f(xk) Ax=br=Ax −b → b=− ∇f(xk)Wenoted-r=bifx0=0Conclusion:ifweterminatetheCGalgorithmafter1iterationweobtainasteepestdescentstepInteractionbetweenCGandNewtonNewton-CGFrameworkTheorem(Newton-CGwithanynumberofCGsteps)Supposethatfisstrictlyconvex.Considertheiteration ∇2f(xk)p=−∇f(xk)+r xk+1=xk+αpwhereαischosenbyabacktrackingArmijolinesearch.Then{xk}→x*SteepestNewtondescent1nRatesofConvergence–ScaleInvarianceTherateofconvergencecanbe:linearsuperlinearquadraticdependingontheaccuracyoftheCGsolution• ItinheritssomeofthescaleinvariancepropertiesoftheexactNewtonmethod:affinechangeofvariablesx←DxNewton-CG–TheNonconvexCaseIfHessianisnotpositivedefinitesolvemodifiedSystem[∇2f(x0)+γI]p=−∇f(x0)γ0IfγislargeenoughsystemispositivedefiniteLetλ1≤λ2≤...≤λnbetheeigenvaluesof∇2f(x0).Thentheeigenvaluesof [∇2f(x0)+γI]are: λ1+γ≤λ2+γ≤...≤λn+γDifficulttochooseγ. TrustregionmethodlearnsγTheNonconvexCase:AlternativesBp=−∇f(x0)Replace∇2f(xk)byapositivedefiniteapproximationOption1:Gauss-NewtonMatrixJ(xk)J(xk)TOption2:StopCGearly-negativecurvatureOption3:TrustregionapproachForleastsquaresGauss-NewtonmatrixcanbeseenasanoptimalInexactCGequivalenttosolvingina“subspaceinverse”TheNonconvexCase:CGTermination∇2f(xk)p=−∇f(x0)Iterateuntilnegativecurvatureisencountered:vT∇f(xk)v0xkNegativeCurvatureTrustregionmin q(d)=dT∇2f(xk)d+∇f(xk)Td+f(xk) s.t.d≤ΔHistoryofNewton-CG1. Proposedinthe1980sforunconstrainedoptimizationandsystemsofequations(Polyak1960(bounds))2.Hessian-freeoptionidentifiedearlyon3.Trustregion(1980s):robusttechniqueforchoosing4. Consideredtodayapremiertechniqueforlargeproblems(togetherwithnonlinearCGandL-BFGS)5. Usedingeneralnonlinearprogramming:InteriorPoint,ActiveSet,AugmentedLagrangianmethods6. Applicationtostochasticproblems(machinelearning)Martens(2010),Byrd,Chin,Neveitt,Nocedal(2011)γNewton-CGandglobalminimization1. IknowofnoargumenttosuggestthatNewton-likemethodsa