KernelridgeRegressionMaxWellingDepartmentofComputerScienceUniversityofToronto10King’sCollegeRoadToronto,M5S3G5Canadawelling@cs.toronto.eduAbstractThisisanotetoexplainkernelridgeregression.1RidgeRegressionPossiblythemostelementaryalgorithmthatcanbekernelizedisridgeregression.Hereourtaskistofindalinearfunctionthatmodelsthedependenciesbetweencovariatesfxigandresponsevariablesfyig,bothcontinuous.Theclassicalwaytodothatistominimizethequadraticcost,C(w)=12Xi(yi¡wTxi)2(1)However,ifwearegoingtoworkinfeaturespace,wherewereplacexi!©(xi),thereisancleardangerthatweoverfit.Henceweneedtoregularize.Thisisanimportanttopicthatwillreturninfutureclasses.Asimpleyeteffectivewaytoregularizeistopenalizethenormofw.Thisissometimescalled“weight-decay”.Itremainstobedeterminedhowtochoose¸.Themostusedalgorithmistousecrossvalidationorleave-one-outestimates.Thetotalcostfunctionhencebecomes,C=12Xi(yi¡wTxi)2+12¸jjwjj2(2)whichneedstobeminimized.Takingderivativesandequatingthemtozerogives,Xi(yi¡wTxi)xi=¸w)w=øI+XixixTi!¡10@Xjyjxj1A(3)Weseethattheregularizationtermhelpstostabilizetheinversenumericallybyboundingthesmallesteigenvaluesawayfromzero.2KernelRidgeRegressionWenowreplacealldata-caseswiththeirfeaturevector:xi!©i=©(xi).Inthiscasethenumberofdimensionscanbemuchhigher,oreveninfinitelyhigher,thanthenumberofdata-cases.Thereisaneattrickthatallowsustoperformtheinverseaboveinsmallestspaceofthetwopossibilities,eitherthedimensionofthefeaturespaceorthenumberofdata-cases.Thetrickisgivenbythefollowingidentity,(P¡1+BTR¡1B)¡1BTR¡1=PBT(BPBT+R)¡1(4)NownotethatifBisnotsquare,theinverseisperformedinspacesofdifferentdimension-ality.Toapplythistoourcasewedefine©=©aiandy=yi.Thesolutionisthengivenby,w=(¸Id+©©T)¡1©y=©(©T©+¸In)¡1y(5)Thisequationcanberewrittenas:w=Pi®i©(xi)with®=(©T©+¸In)¡1y.Thisisanequationthatwillbearecurrentthemeanditcanbeinterpretedas:Thesolutionwmustlieinthespanofthedata-cases,evenifthedimensionalityofthefeaturespaceismuchlargerthanthenumberofdata-cases.Thisseemsintuitivelyclear,sincethealgorithmislinearinfeaturespace.Wefinallyneedtoshowthatweneveractuallyneedaccesstothefeaturevectors,whichcouldbeinfinitelylong(whichwouldberatherimpractical).Whatweneedinpracticeisisthepredictedvalueforanewtestpoint,x.Thisiscomputedbyprojectingitontothesolutionw,y=wT©(x)=y(©T©+¸In)¡1©T©(x)=y(K+¸In)¡1·(x)(6)whereK(bxi;bxj)=©(xi)T©(xj)and·(x)=K(xi;x).TheimportantmessagehereisofcoursethatweonlyneedaccesstothekernelK.Wecannowaddbiastothewholestorybyaddingonemore,constantfeatureto©:©0=1.Thevalueofw0thenrepresentsthebiassince,wT©=Xawa©ai+w0(7)Hence,thestorygoesthroughunchanged.3AnalternativederivationInsteadofoptimizingthecostfunctionabovewecanintroduceLagrangemultipliersintotheproblem.ThiswillhavetheeffectthatthederivationgoesalongsimilarlinesastheSVMcase.Weintroducenewvariables,»i=yi¡wT©iandrewritetheobjectiveasthefollowingconstrainedQP,minimize¡w;»LP=Xi»2isubjecttoyi¡wT©i=»i8i(8)jjwjj·B(9)ThisleadstotheLagrangian,LP=Xi»2i+Xi¯i[yi¡wT©i¡»i]+¸(jjwjj2¡B2)(10)TwooftheKKTconditionstellusthatatthesolutionwehave:2»i=¯i8i;2¸w=Xi¯i©i(11)PluggingitbackintotheLagrangian,weobtainthedualLagrangian,LD=Xi(¡14¯2i+¯iyi)¡14¸Xij(¯i¯jKij)¡¸B2(12)Wenowredefine®i=¯i=(2¸)toarriveatthefollowingdualoptimizationproblem,maximize¡®;¸¡¸2Xi®2i+2¸Xi®iyi¡¸Xij®i®jKij¡¸B2s:t:¸¸0(13)Takingderivativesw.r.t.®givespreciselythesolutionwehadalreadyfound,®¤i=(K+¸I)¡1y(14)Formallywealsoneedtomaximizeover¸.However,differentchoicesof¸correspondtodifferentchoicesforB.Either¸orBshouldbechosenusingcross-validationorsomeothermeasure,sowecouldaswellvary¸inthisprocess.Onebigdisadvantageoftheridge-regressionisthatwedon’thavesparsenessinthe®vec-tor,i.e.thereisnoconceptofsupportvectors.Thisisusefulbecausewhenwetestanewexample,weonlyhavetosumoverthesupportvectorswhichismuchfasterthansummingovertheentiretraining-set.IntheSVMthesparsenesswasbornoutoftheinequalitycon-straintsbecausethecomplementaryslacknessconditionstoldusthateitheriftheconstraintwasinactive,thenthemultiplier®iwaszero.Thereisnosucheffecthere.