斯坦福机器学习讲义(全)Stanford-Machine-Leaning

joy51320
1 ℃
2019-12-29

整理文档很辛苦，赏杯茶钱您下走！

还剩 ... 页未读，继续阅读 >>

免费阅读已结束，点击下载阅读编辑剩下 ... 页

阅读已结束，您可以下载文档离线阅读编辑

资源描述

CS229LecturenotesAndrewNgSupervisedlearningLetsstartbytalkingaboutafewexamplesofsupervisedlearningproblems.Supposewehaveadatasetgivingthelivingareasandpricesof47housesfromPortland,Oregon:Livingarea(feet2)Price(1000$s)21044001600330240036914162323000540......Wecanplotthisdata:50010001500200025003000350040004500500001002003004005006007008009001000housingpricessquarefeetprice(in$1000)Givendatalikethis,howcanwelearntopredictthepricesofotherhousesinPortland,asafunctionofthesizeoftheirlivingareas?1CS229Winter20032Toestablishnotationforfutureuse,we’llusex(i)todenotethe“input”variables(livingareainthisexample),alsocalledinputfeatures,andy(i)todenotethe“output”ortargetvariablethatwearetryingtopredict(price).Apair(x(i),y(i))iscalledatrainingexample,andthedatasetthatwe’llbeusingtolearn—alistofmtrainingexamples{(x(i),y(i));i=1,...,m}—iscalledatrainingset.Notethatthesuperscript“(i)”inthenotationissimplyanindexintothetrainingset,andhasnothingtodowithexponentiation.WewillalsouseXdenotethespaceofinputvalues,andYthespaceofoutputvalues.Inthisexample,X=Y=R.Todescribethesupervisedlearningproblemslightlymoreformally,ourgoalis,givenatrainingset,tolearnafunctionh:X7→Ysothath(x)isa“good”predictorforthecorrespondingvalueofy.Forhistoricalreasons,thisfunctionhiscalledahypothesis.Seenpictorially,theprocessisthereforelikethis:Trainingsethouse.)(livingareaofLearningalgorithmhpredictedyx(predictedprice)ofhouse)Whenthetargetvariablethatwe’retryingtopredictiscontinuous,suchasinourhousingexample,wecallthelearningproblemaregressionprob-lem.Whenycantakeononlyasmallnumberofdiscretevalues(suchasif,giventhelivingarea,wewantedtopredictifadwellingisahouseoranapartment,say),wecallitaclassiﬁcationproblem.3PartILinearRegressionTomakeourhousingexamplemoreinteresting,letsconsideraslightlyricherdatasetinwhichwealsoknowthenumberofbedroomsineachhouse:Livingarea(feet2)#bedroomsPrice(1000$s)2104340016003330240033691416223230004540.........Here,thex’saretwo-dimensionalvectorsinR2.Forinstance,x(i)1isthelivingareaofthei-thhouseinthetrainingset,andx(i)2isitsnumberofbedrooms.(Ingeneral,whendesigningalearningproblem,itwillbeuptoyoutodecidewhatfeaturestochoose,soifyouareoutinPortlandgatheringhousingdata,youmightalsodecidetoincludeotherfeaturessuchaswhethereachhousehasaﬁreplace,thenumberofbathrooms,andsoon.We’llsaymoreaboutfeatureselectionlater,butfornowletstakethefeaturesasgiven.)Toperformsupervisedlearning,wemustdecidehowwe’regoingtorep-resentfunctions/hypotheseshinacomputer.Asaninitialchoice,letssaywedecidetoapproximateyasalinearfunctionofx:hθ(x)=θ0+θ1x1+θ2x2Here,theθi’saretheparameters(alsocalledweights)parameterizingthespaceoflinearfunctionsmappingfromXtoY.Whenthereisnoriskofconfusion,wewilldroptheθsubscriptinhθ(x),andwriteitmoresimplyash(x).Tosimplifyournotation,wealsointroducetheconventionoflettingx0=1(thisistheinterceptterm),sothath(x)=nXi=0θixi=θTx,whereontheright-handsideaboveweareviewingθandxbothasvectors,andherenisthenumberofinputvariables(notcountingx0).Now,givenatrainingset,howdowepick,orlearn,theparametersθ?Onereasonablemethodseemstobetomakeh(x)closetoy,atleastfor4thetrainingexampleswehave.Toformalizethis,wewilldeﬁneafunctionthatmeasures,foreachvalueoftheθ’s,howclosetheh(x(i))’saretothecorrespondingy(i)’s.Wedeﬁnethecostfunction:J(θ)=12mXi=1(hθ(x(i))−y(i))2.Ifyou’veseenlinearregressionbefore,youmayrecognizethisasthefamiliarleast-squarescostfunctionthatgivesrisetotheordinaryleastsquaresregressionmodel.Whetherornotyouhaveseenitpreviously,letskeepgoing,andwe’lleventuallyshowthistobeaspecialcaseofamuchbroaderfamilyofalgorithms.1LMSalgorithmWewanttochooseθsoastominimizeJ(θ).Todoso,letsuseasearchalgorithmthatstartswithsome“initialguess”forθ,andthatrepeatedlychangesθtomakeJ(θ)smaller,untilhopefullyweconvergetoavalueofθthatminimizesJ(θ).Speciﬁcally,letsconsiderthegradientdescentalgorithm,whichstartswithsomeinitialθ,andrepeatedlyperformstheupdate:θj:=θj−α∂∂θjJ(θ).(Thisupdateissimultaneouslyperformedforallvaluesofj=0,...,n.)Here,αiscalledthelearningrate.ThisisaverynaturalalgorithmthatrepeatedlytakesastepinthedirectionofsteepestdecreaseofJ.Inordertoimplementthisalgorithm,wehavetoworkoutwhatisthepartialderivativetermontherighthandside.Letsﬁrstworkitoutforthecaseofifwehaveonlyonetrainingexample(x,y),sothatwecanneglectthesuminthedeﬁnitionofJ.Wehave:∂∂θjJ(θ)=∂∂θj12(hθ(x)−y)2=2·12(hθ(x)−y)·∂∂θj(hθ(x)−y)=(hθ(x)−y)·∂∂θjnXi=0θixi−y!=(hθ(x)−y)xj5Forasingletrainingexample,thisgivestheupdaterule:1θj:=θj+αy(i)−hθ(x(i))x(i)j.TheruleiscalledtheLMSupdaterule(LMSstandsfor“leastmeansquares”),andisalsoknownastheWidrow-Hoﬀlearningrule.Thisrulehasseveralpropertiesthatseemnaturalandintuitive.Forinstance,themagnitudeoftheupdateisproportionaltotheerrorterm(y(i)−hθ(x(i)));thus,forin-stance,ifweareencounteringatrainingexampleonwhichourpredictionnearlymatchestheactualvalueofy(i),thenweﬁndthatthereislittleneedtochangetheparameters;incontrast,alargerchangetotheparameterswillbemadeifourpredictionhθ(x(i))hasalargeerror(i.e.,ifitisveryfarfromy(i)).We’dderivedtheLMSruleforwhentherewasonlyasi