1FasterR-CNN:TowardsReal-TimeObjectDetectionwithRegionProposalNetworksShaoqingRen,KaimingHe,RossGirshick,andJianSunAbstract—State-of-the-artobjectdetectionnetworksdependonregionproposalalgorithmstohypothesizeobjectlocations.AdvanceslikeSPPnet[1]andFastR-CNN[2]havereducedtherunningtimeofthesedetectionnetworks,exposingregionproposalcomputationasabottleneck.Inthiswork,weintroduceaRegionProposalNetwork(RPN)thatsharesfull-imageconvolutionalfeatureswiththedetectionnetwork,thusenablingnearlycost-freeregionproposals.AnRPNisafullyconvolutionalnetworkthatsimultaneouslypredictsobjectboundsandobjectnessscoresateachposition.TheRPNistrainedend-to-endtogeneratehigh-qualityregionproposals,whichareusedbyFastR-CNNfordetection.WefurthermergeRPNandFastR-CNNintoasinglenetworkbysharingtheirconvolutionalfeatures—usingtherecentlypopularterminologyofneuralnetworkswith“attention”mechanisms,theRPNcomponenttellstheunifiednetworkwheretolook.FortheverydeepVGG-16model[3],ourdetectionsystemhasaframerateof5fps(includingallsteps)onaGPU,whileachievingstate-of-the-artobjectdetectionaccuracyonPASCALVOC2007,2012,andMSCOCOdatasetswithonly300proposalsperimage.InILSVRCandCOCO2015competitions,FasterR-CNNandRPNarethefoundationsofthe1st-placewinningentriesinseveraltracks.Codehasbeenmadepubliclyavailable.IndexTerms—ObjectDetection,RegionProposal,ConvolutionalNeuralNetwork.F1INTRODUCTIONRecentadvancesinobjectdetectionaredrivenbythesuccessofregionproposalmethods(e.g.,[4])andregion-basedconvolutionalneuralnetworks(R-CNNs)[5].Althoughregion-basedCNNswerecom-putationallyexpensiveasoriginallydevelopedin[5],theircosthasbeendrasticallyreducedthankstoshar-ingconvolutionsacrossproposals[1],[2].Thelatestincarnation,FastR-CNN[2],achievesnearreal-timeratesusingverydeepnetworks[3],whenignoringthetimespentonregionproposals.Now,proposalsarethetest-timecomputationalbottleneckinstate-of-the-artdetectionsystems.Regionproposalmethodstypicallyrelyoninex-pensivefeaturesandeconomicalinferenceschemes.SelectiveSearch[4],oneofthemostpopularmeth-ods,greedilymergessuperpixelsbasedonengineeredlow-levelfeatures.Yetwhencomparedtoefficientdetectionnetworks[2],SelectiveSearchisanorderofmagnitudeslower,at2secondsperimageinaCPUimplementation.EdgeBoxes[6]currentlyprovidesthebesttradeoffbetweenproposalqualityandspeed,at0.2secondsperimage.Nevertheless,theregionproposalstepstillconsumesasmuchrunningtimeasthedetectionnetwork.S.ReniswithUniversityofScienceandTechnologyofChina,Hefei,China.ThisworkwasdonewhenS.RenwasaninternatMicrosoftResearch.Email:sqren@mail.ustc.edu.cnK.HeandJ.SunarewithVisualComputingGroup,MicrosoftResearch.E-mail:fkahe,jiansung@microsoft.comR.GirshickiswithFacebookAIResearch.ThemajorityofthisworkwasdonewhenR.GirshickwaswithMicrosoftResearch.E-mail:rbg@fb.comOnemaynotethatfastregion-basedCNNstakeadvantageofGPUs,whiletheregionproposalmeth-odsusedinresearchareimplementedontheCPU,makingsuchruntimecomparisonsinequitable.Anob-viouswaytoaccelerateproposalcomputationistore-implementitfortheGPU.Thismaybeaneffectiveen-gineeringsolution,butre-implementationignoresthedown-streamdetectionnetworkandthereforemissesimportantopportunitiesforsharingcomputation.Inthispaper,weshowthatanalgorithmicchange—computingproposalswithadeepconvolutionalneu-ralnetwork—leadstoanelegantandeffectivesolutionwhereproposalcomputationisnearlycost-freegiventhedetectionnetwork’scomputation.Tothisend,weintroducenovelRegionProposalNetworks(RPNs)thatshareconvolutionallayerswithstate-of-the-artobjectdetectionnetworks[1],[2].Bysharingconvolutionsattest-time,themarginalcostforcomputingproposalsissmall(e.g.,10msperimage).Ourobservationisthattheconvolutionalfeaturemapsusedbyregion-baseddetectors,likeFastR-CNN,canalsobeusedforgeneratingregionpro-posals.Ontopoftheseconvolutionalfeatures,weconstructanRPNbyaddingafewadditionalcon-volutionallayersthatsimultaneouslyregressregionboundsandobjectnessscoresateachlocationonaregulargrid.TheRPNisthusakindoffullyconvo-lutionalnetwork(FCN)[7]andcanbetrainedend-to-endspecificallyforthetaskforgeneratingdetectionproposals.RPNsaredesignedtoefficientlypredictregionpro-posalswithawiderangeofscalesandaspectratios.Incontrasttoprevalentmethods[8],[9],[1],[2]thatusearXiv:1506.01497v3[cs.CV]6Jan20162multiplescaledimagesmultiplefiltersizesmultiplereferences(a)(b)(c)imagefeaturemapimagefeaturemapimagefeaturemapFigure1:Differentschemesforaddressingmultiplescalesandsizes.(a)Pyramidsofimagesandfeaturemapsarebuilt,andtheclassifierisrunatallscales.(b)Pyramidsoffilterswithmultiplescales/sizesarerunonthefeaturemap.(c)Weusepyramidsofreferenceboxesintheregressionfunctions.pyramidsofimages(Figure1,a)orpyramidsoffilters(Figure1,b),weintroducenovel“anchor”boxesthatserveasreferencesatmultiplescalesandaspectratios.Ourschemecanbethoughtofasapyramidofregressionreferences(Figure1,c),whichavoidsenumeratingimagesorfiltersofmultiplescalesoraspectratios.Thismodelperformswellwhentrainedandtestedusingsingle-scaleimagesandthusbenefitsrunningspeed.TounifyRPNswithFastR-CNN[2]objectdetec-tionnetworks,weproposeatrainingschemethat