Deep-Residual-Learning-for-Image-Recognition

248380822
0 ℃
2019-12-01

整理文档很辛苦，赏杯茶钱您下走！

还剩 ... 页未读，继续阅读 >>

免费阅读已结束，点击下载阅读编辑剩下 ... 页

阅读已结束，您可以下载文档离线阅读编辑

资源描述

DeepResidualLearningforImageRecognitionKaimingHeXiangyuZhangShaoqingRenJianSunMicrosoftResearchfkahe,v-xiangz,v-shren,jiansung@microsoft.comAbstractDeeperneuralnetworksaremoredifﬁculttotrain.Wepresentaresiduallearningframeworktoeasethetrainingofnetworksthataresubstantiallydeeperthanthoseusedpreviously.Weexplicitlyreformulatethelayersaslearn-ingresidualfunctionswithreferencetothelayerinputs,in-steadoflearningunreferencedfunctions.Weprovidecom-prehensiveempiricalevidenceshowingthattheseresidualnetworksareeasiertooptimize,andcangainaccuracyfromconsiderablyincreaseddepth.OntheImageNetdatasetweevaluateresidualnetswithadepthofupto152layers—8deeperthanVGGnets[41]butstillhavinglowercomplex-ity.Anensembleoftheseresidualnetsachieves3.57%errorontheImageNettestset.Thisresultwonthe1stplaceontheILSVRC2015classiﬁcationtask.WealsopresentanalysisonCIFAR-10with100and1000layers.Thedepthofrepresentationsisofcentralimportanceformanyvisualrecognitiontasks.Solelyduetoourex-tremelydeeprepresentations,weobtaina28%relativeim-provementontheCOCOobjectdetectiondataset.DeepresidualnetsarefoundationsofoursubmissionstoILSVRC&COCO2015competitions1,wherewealsowonthe1stplacesonthetasksofImageNetdetection,ImageNetlocal-ization,COCOdetection,andCOCOsegmentation.1.IntroductionDeepconvolutionalneuralnetworks[22,21]haveledtoaseriesofbreakthroughsforimageclassiﬁcation[21,50,40].Deepnetworksnaturallyintegratelow/mid/high-levelfeatures[50]andclassiﬁersinanend-to-endmulti-layerfashion,andthe“levels”offeaturescanbeenrichedbythenumberofstackedlayers(depth).Recentevidence[41,44]revealsthatnetworkdepthisofcrucialimportance,andtheleadingresults[41,44,13,16]onthechallengingImageNetdataset[36]allexploit“verydeep”[41]models,withadepthofsixteen[41]tothirty[16].Manyothernon-trivialvisualrecognitiontasks[8,12,7,32,27]havealso1://mscoco.org/dataset/#detections-challenge2015.012345601020iter.(1e4)trainingerror(%)012345601020iter.(1e4)testerror(%)56-layer20-layer56-layer20-layerFigure1.Trainingerror(left)andtesterror(right)onCIFAR-10with20-layerand56-layer“plain”networks.Thedeepernetworkhashighertrainingerror,andthustesterror.SimilarphenomenaonImageNetispresentedinFig.4.greatlybeneﬁtedfromverydeepmodels.Drivenbythesigniﬁcanceofdepth,aquestionarises:Islearningbetternetworksaseasyasstackingmorelayers?Anobstacletoansweringthisquestionwasthenotoriousproblemofvanishing/explodinggradients[1,9],whichhamperconvergencefromthebeginning.Thisproblem,however,hasbeenlargelyaddressedbynormalizedinitial-ization[23,9,37,13]andintermediatenormalizationlayers[16],whichenablenetworkswithtensoflayerstostartcon-vergingforstochasticgradientdescent(SGD)withback-propagation[22].Whendeepernetworksareabletostartconverging,adegradationproblemhasbeenexposed:withthenetworkdepthincreasing,accuracygetssaturated(whichmightbeunsurprising)andthendegradesrapidly.Unexpectedly,suchdegradationisnotcausedbyoverﬁtting,andaddingmorelayerstoasuitablydeepmodelleadstohighertrain-ingerror,asreportedin[11,42]andthoroughlyveriﬁedbyourexperiments.Fig.1showsatypicalexample.Thedegradation(oftrainingaccuracy)indicatesthatnotallsystemsaresimilarlyeasytooptimize.Letusconsiderashallowerarchitectureanditsdeepercounterpartthataddsmorelayersontoit.Thereexistsasolutionbyconstructiontothedeepermodel:theaddedlayersareidentitymapping,andtheotherlayersarecopiedfromthelearnedshallowermodel.Theexistenceofthisconstructedsolutionindicatesthatadeepermodelshouldproducenohighertrainingerrorthanitsshallowercounterpart.Butexperimentsshowthatourcurrentsolversonhandareunabletoﬁndsolutionsthat1arXiv:1512.03385v1[cs.CV]10Dec2015identityweightlayerweightlayerrelureluF(x)+xxF(x)xFigure2.Residuallearning:abuildingblock.arecomparablygoodorbetterthantheconstructedsolution(orunabletodosoinfeasibletime).Inthispaper,weaddressthedegradationproblembyintroducingadeepresiduallearningframework.In-steadofhopingeachfewstackedlayersdirectlyﬁtadesiredunderlyingmapping,weexplicitlylettheselay-ersﬁtaresidualmapping.Formally,denotingthedesiredunderlyingmappingasH(x),weletthestackednonlinearlayersﬁtanothermappingofF(x):=H(x)x.Theorig-inalmappingisrecastintoF(x)+x.Wehypothesizethatitiseasiertooptimizetheresidualmappingthantooptimizetheoriginal,unreferencedmapping.Totheextreme,ifanidentitymappingwereoptimal,itwouldbeeasiertopushtheresidualtozerothantoﬁtanidentitymappingbyastackofnonlinearlayers.TheformulationofF(x)+xcanberealizedbyfeedfor-wardneuralnetworkswith“shortcutconnections”(Fig.2).Shortcutconnections[2,34,49]arethoseskippingoneormorelayers.Inourcase,theshortcutconnectionssimplyperformidentitymapping,andtheiroutputsareaddedtotheoutputsofthestackedlayers(Fig.2).Identityshort-cutconnectionsaddneitherextraparameternorcomputa-tionalcomplexity.Theentirenetworkcanstillbetrainedend-to-endbySGDwithbackpropagation,andcanbeeas-ilyimplementedusingcommonlibraries(e.g.,Caffe[19])withoutmodifyingthesolvers.WepresentcomprehensiveexperimentsonImageNet[36]toshowthedegradationproblemandevaluateourmethod.Weshowthat:1)Ou