315DeepSparseRectierNeuralNetworksXavierGlorotAntoineBordesYoshuaBengioDIRO,UniversitedeMontrealMontreal,QC,Canadaglorotxa@iro.umontreal.caHeudiasyc,UMRCNRS6599UTC,Compiegne,FranceandDIRO,UniversitedeMontrealMontreal,QC,Canadaantoine.bordes@hds.utc.frDIRO,UniversitedeMontrealMontreal,QC,Canadabengioy@iro.umontreal.caAbstractWhilelogisticsigmoidneuronsaremorebi-ologicallyplausiblethanhyperbolictangentneurons,thelatterworkbetterfortrain-ingmulti-layerneuralnetworks.Thispa-pershowsthatrectifyingneuronsareanevenbettermodelofbiologicalneuronsandyieldequalorbetterperformancethanhy-perbolictangentnetworksinspiteofthehardnon-linearityandnon-dierentiabilityatzero,creatingsparserepresentationswithtruezeros,whichseemremarkablysuitablefornaturallysparsedata.Eventhoughtheycantakeadvantageofsemi-supervisedsetupswithextra-unlabeleddata,deeprectiernet-workscanreachtheirbestperformancewith-outrequiringanyunsupervisedpre-trainingonpurelysupervisedtaskswithlargelabeleddatasets.Hence,theseresultscanbeseenasanewmilestoneintheattemptsatunder-standingthedicultyintrainingdeepbutpurelysupervisedneuralnetworks,andclos-ingtheperformancegapbetweenneuralnet-workslearntwithandwithoutunsupervisedpre-training.1IntroductionManydierencesexistbetweentheneuralnetworkmodelsusedbymachinelearningresearchersandthoseusedbycomputationalneuroscientists.ThisisinpartAppearinginProceedingsofthe14thInternationalCon-ferenceonArticialIntelligenceandStatistics(AISTATS)2011,FortLauderdale,FL,USA.Volume15ofJMLR:W&CP15.Copyright2011bytheauthors.becausetheobjectiveoftheformeristoobtaincom-putationallyecientlearners,thatgeneralizewelltonewexamples,whereastheobjectiveofthelatteristoabstractoutneuroscienticdatawhileobtainingex-planationsoftheprinciplesinvolved,providingpredic-tionsandguidanceforfuturebiologicalexperiments.Areaswherebothobjectivescoincidearethereforeparticularlyworthyofinvestigation,pointingtowardscomputationallymotivatedprinciplesofoperationinthebrainthatcanalsoenhanceresearchinarticialintelligence.Inthispaperweshowthattwocom-mongapsbetweencomputationalneurosciencemodelsandmachinelearningneuralnetworkmodelscanbebridgedbyusingthefollowinglinearbypartactiva-tion:max(0;x),calledtherectier(orhinge)activa-tionfunction.Experimentalresultswillshowengagingtrainingbehaviorofthisactivationfunction,especiallyfordeeparchitectures(seeBengio(2009)forareview),i.e.,wherethenumberofhiddenlayersintheneuralnetworkis3ormore.Recenttheoreticalandempiricalworkinstatisticalmachinelearninghasdemonstratedtheimportanceoflearningalgorithmsfordeeparchitectures.Thisisinpartinspiredbyobservationsofthemammalianvi-sualcortex,whichconsistsofachainofprocessingelements,eachofwhichisassociatedwithadierentrepresentationoftherawvisualinput.Thisispartic-ularlyclearintheprimatevisualsystem(Serreetal.,2007),withitssequenceofprocessingstages:detectionofedges,primitiveshapes,andmovinguptogradu-allymorecomplexvisualshapes.Interestingly,itwasfoundthatthefeatureslearnedindeeparchitecturesresemblethoseobservedinthersttwoofthesestages(inareasV1andV2ofvisualcortex)(Leeetal.,2008),andthattheybecomeincreasinglyinvarianttofactorsofvariation(suchascameramovement)inhigherlay-ers(Goodfellowetal.,2009).316DeepSparseRectierNeuralNetworksRegardingthetrainingofdeepnetworks,somethingthatcanbeconsideredabreakthroughhappenedin2006,withtheintroductionofDeepBeliefNet-works(Hintonetal.,2006),andmoregenerallytheideaofinitializingeachlayerbyunsupervisedlearn-ing(Bengioetal.,2007;Ranzatoetal.,2007).Someauthorshavetriedtounderstandwhythisunsuper-visedprocedurehelps(Erhanetal.,2010)whileoth-ersinvestigatedwhytheoriginaltrainingprocedurefordeepneuralnetworksfailed(BengioandGlorot,2010).Fromthemachinelearningpointofview,thispaperbringsadditionalresultsintheselinesofinvestigation.Weproposetoexploretheuseofrectifyingnon-linearitiesasalternativestothehyperbolictangentorsigmoidindeeparticialneuralnetworks,inad-ditiontousinganL1regularizerontheactivationval-uestopromotesparsityandpreventpotentialnumer-icalproblemswithunboundedactivation.NairandHinton(2010)presentpromisingresultsoftheinu-enceofsuchunitsinthecontextofRestrictedBoltz-mannMachinescomparedtologisticsigmoidactiva-tionsonimageclassicationtasks.Ourworkextendsthisforthecaseofpre-trainingusingdenoisingauto-encoders(Vincentetal.,2008)andprovidesanexten-siveempiricalcomparisonoftherectifyingactivationfunctionagainstthehyperbolictangentonimageclas-sicationbenchmarksaswellasanoriginalderivationforthetextapplicationofsentimentanalysis.Ourexperimentsonimageandtextdataindicatethattrainingproceedsbetterwhenthearticialneuronsareeitherooroperatingmostlyinalinearregime.Sur-prisingly,rectifyingactivationallowsdeepnetworkstoachievetheirbestperformancewithoutunsupervisedpre-training.Hence,ourworkproposesanewcontri-butiontothetrendofunderstandingandmergingtheperformancegapbetweendeepnetworkslearntwithandwithoutunsupervisedpre-training(Erhanetal.,2010;BengioandGlorot,2010).Still,rectiernet-workscanbenetfromunsupervisedpre-traininginthecontextofsemi-sup