DeepResidualLearningforImageRecognitionKaimingHeXiangyuZhangShaoqingRenJianSunMicrosoftResearchfkahe,v-xiangz,v-shren,jiansung@microsoft.comAbstractDeeperneuralnetworksaremoredifficulttotrain.Wepresentaresiduallearningframeworktoeasethetrainingofnetworksthataresubstantiallydeeperthanthoseusedpreviously.Weexplicitlyreformulatethelayersaslearn-ingresidualfunctionswithreferencetothelayerinputs,in-steadoflearningunreferencedfunctions.Weprovidecom-prehensiveempiricalevidenceshowingthattheseresidualnetworksareeasiertooptimize,andcangainaccuracyfromconsiderablyincreaseddepth.OntheImageNetdatasetweevaluateresidualnetswithadepthofupto152layers—8deeperthanVGGnets[41]butstillhavinglowercomplex-ity.Anensembleoftheseresidualnetsachieves3.57%errorontheImageNettestset.Thisresultwonthe1stplaceontheILSVRC2015classificationtask.WealsopresentanalysisonCIFAR-10with100and1000layers.Thedepthofrepresentationsisofcentralimportanceformanyvisualrecognitiontasks.Solelyduetoourex-tremelydeeprepresentations,weobtaina28%relativeim-provementontheCOCOobjectdetectiondataset.DeepresidualnetsarefoundationsofoursubmissionstoILSVRC&COCO2015competitions1,wherewealsowonthe1stplacesonthetasksofImageNetdetection,ImageNetlocal-ization,COCOdetection,andCOCOsegmentation.1.IntroductionDeepconvolutionalneuralnetworks[22,21]haveledtoaseriesofbreakthroughsforimageclassification[21,50,40].Deepnetworksnaturallyintegratelow/mid/high-levelfeatures[50]andclassifiersinanend-to-endmulti-layerfashion,andthe“levels”offeaturescanbeenrichedbythenumberofstackedlayers(depth).Recentevidence[41,44]revealsthatnetworkdepthisofcrucialimportance,andtheleadingresults[41,44,13,16]onthechallengingImageNetdataset[36]allexploit“verydeep”[41]models,withadepthofsixteen[41]tothirty[16].Manyothernon-trivialvisualrecognitiontasks[8,12,7,32,27]havealso1://mscoco.org/dataset/#detections-challenge2015.012345601020iter.(1e4)trainingerror(%)012345601020iter.(1e4)testerror(%)56-layer20-layer56-layer20-layerFigure1.Trainingerror(left)andtesterror(right)onCIFAR-10with20-layerand56-layer“plain”networks.Thedeepernetworkhashighertrainingerror,andthustesterror.SimilarphenomenaonImageNetispresentedinFig.4.greatlybenefitedfromverydeepmodels.Drivenbythesignificanceofdepth,aquestionarises:Islearningbetternetworksaseasyasstackingmorelayers?Anobstacletoansweringthisquestionwasthenotoriousproblemofvanishing/explodinggradients[1,9],whichhamperconvergencefromthebeginning.Thisproblem,however,hasbeenlargelyaddressedbynormalizedinitial-ization[23,9,37,13]andintermediatenormalizationlayers[16],whichenablenetworkswithtensoflayerstostartcon-vergingforstochasticgradientdescent(SGD)withback-propagation[22].Whendeepernetworksareabletostartconverging,adegradationproblemhasbeenexposed:withthenetworkdepthincreasing,accuracygetssaturated(whichmightbeunsurprising)andthendegradesrapidly.Unexpectedly,suchdegradationisnotcausedbyoverfitting,andaddingmorelayerstoasuitablydeepmodelleadstohighertrain-ingerror,asreportedin[11,42]andthoroughlyverifiedbyourexperiments.Fig.1showsatypicalexample.Thedegradation(oftrainingaccuracy)indicatesthatnotallsystemsaresimilarlyeasytooptimize.Letusconsiderashallowerarchitectureanditsdeepercounterpartthataddsmorelayersontoit.Thereexistsasolutionbyconstructiontothedeepermodel:theaddedlayersareidentitymapping,andtheotherlayersarecopiedfromthelearnedshallowermodel.Theexistenceofthisconstructedsolutionindicatesthatadeepermodelshouldproducenohighertrainingerrorthanitsshallowercounterpart.Butexperimentsshowthatourcurrentsolversonhandareunabletofindsolutionsthat1arXiv:1512.03385v1[cs.CV]10Dec2015identityweightlayerweightlayerrelureluF(x)+xxF(x)xFigure2.Residuallearning:abuildingblock.arecomparablygoodorbetterthantheconstructedsolution(orunabletodosoinfeasibletime).Inthispaper,weaddressthedegradationproblembyintroducingadeepresiduallearningframework.In-steadofhopingeachfewstackedlayersdirectlyfitadesiredunderlyingmapping,weexplicitlylettheselay-ersfitaresidualmapping.Formally,denotingthedesiredunderlyingmappingasH(x),weletthestackednonlinearlayersfitanothermappingofF(x):=H(x) x.Theorig-inalmappingisrecastintoF(x)+x.Wehypothesizethatitiseasiertooptimizetheresidualmappingthantooptimizetheoriginal,unreferencedmapping.Totheextreme,ifanidentitymappingwereoptimal,itwouldbeeasiertopushtheresidualtozerothantofitanidentitymappingbyastackofnonlinearlayers.TheformulationofF(x)+xcanberealizedbyfeedfor-wardneuralnetworkswith“shortcutconnections”(Fig.2).Shortcutconnections[2,34,49]arethoseskippingoneormorelayers.Inourcase,theshortcutconnectionssimplyperformidentitymapping,andtheiroutputsareaddedtotheoutputsofthestackedlayers(Fig.2).Identityshort-cutconnectionsaddneitherextraparameternorcomputa-tionalcomplexity.Theentirenetworkcanstillbetrainedend-to-endbySGDwithbackpropagation,andcanbeeas-ilyimplementedusingcommonlibraries(e.g.,Caffe[19])withoutmodifyingthesolvers.WepresentcomprehensiveexperimentsonImageNet[36]toshowthedegradationproblemandevaluateourmethod.Weshowthat:1)Ou