MachineLearningBasicTheoryLectureNotes1:BasicTheoryProfessor:ZhihuaZhangScribe:ChengChen,LuoLuo,CongXie1IntroductionInmachinelearning,dataistypicallyexpressedinamatrixform.Supposewehavensamples,pvariables(orfeatures).ThenwehaveX=x11x12x1px21x22x2p......xn1xn2xnpnpTheithsamplecanbedenotedasXi=(X11;X12;:::;X1p)T.Machinelearningismainlytosolvethefollowingproblems:(1)DimensionReduction:Dimensionreductionistheprocessofreducingthenumberofrandomvariables(orfeatures)underconsideration.Formally,letXi2Rp,wewanttofindZi2Rq(qp)topresentXi.Ifweuselineartransformation,thenweneedtofindamatrixAsuchthatZi=AXi.NotethatAshouldbefullrowrank.Ifweusenonlineartransformation,thenweneedtofindanonlinearfunctionfsuchthatZi=f(Xi).(2)Clustering:Clusteringisthetaskofgroupingasetofobjectsinsuchawaythatobjectsinthesamegroup(calledacluster)aremoresimilar(insomesenseoranother)toeachotherthantothoseinothergroups(clusters).Wecanviewnsamplesasnpoints,andourobjectistoclusterthemintokclusters.(3)Classication:Classificationistheproblemofidentifyingtowhichofasetofcat-egoriesanewobservationbelongs,onthebasisofatrainingsetofdatacontainingobservations(orinstances)whosecategorymembershipisknown.Formally,inthetrainingset,wehavealabelYiforeachXi,whereYi2C,Cisanon-emptyfiniteset.IfYi2f 1;1gorf0;1g,it’sabinaryclassificationproblem.IfYi2f1;2;:::;kg,it’samulti-classclassificationproblem.Therearealsoproblemsthatoneobservationbelongstomorethanonecategoryandtheyarecalledmulti-labelormulti-outputclassification.(4)Regression:RegressionisaparticularclassificationprobleminwhichthelabelYi2R.(5)Ranking:alsocalledisotonicregression(IR).Isotonicregressioninvolvesfindingaweightedleast-squaresfitx2Rntoavectora2Rnwithweightsvectorw2Rnsubjecttoasetofnon-contradictoryconstraintsofkindxixj.1-1Notethat(1),(2)areunsupervisedlearning,(3),(4),(5)aresupervisedlearning.Unsu-pervisedlearningisthatoftryingtofindhiddenstructureinunlabelleddata.Supervisedlearningisthemachinelearningtaskofinferringafunctionfromlabelledtrainingdata.Forsupervisedlearning,thedataisusuallysplitintotwoorthreeparts.(1)Trainingdata:Asetofexamplesusedforlearning,thatistofittheparameters(e.g.,weightsforneuralnetworks)ofthemodel.(2)Validationdata:Sometimes,wealsoneedavalidationsettotunethemodel,forexampletochoosethenumberofhiddenunitsinaneuralnetworkorforpruningadecisiontree.Itisusuallyusedtopreventoverfittingandenhancethegeneralizationability.(3)Testdata:Thisdatasetisusedonlyfortestingthefinalsolutioninordertoconfirmtheactualperformance.1.1Frequentist'sviewvs.Bayesianview1.1.1Frequentist'sviewThefrequentisticapproachviewsthemodelparametersasunknownconstantsandestimatesthembymatchingthemodeltothetrainingdatausinganappropriatemetric.Example1.1Supposewehavenpairsofsamplesf(xi;yi)gni=1,xi2Rp,yi2RandwewanttotalinearfunctionxTia(Morestrictly,itshouldbexTia+borincludeaconstantvariable1inxi)topredictyj.Usingleastsquares,wehavelossfunctionL=∑ni=1(yi xTia)2,whereaisanunknownxedparameter.Wecansolveabyminimizingthelossfunction.Usingmaximumlikelihoodestimation,letyiN(xTia;2),namely,p(yijxi)=1(2)12e (yi xTia)222:Sotheloglikelihoodis(assumingthesamplesareindependent)l=logn∏i=1p(yijxi):Wecansolveabymaximizingthejointlikelihood.Undertheaboveconditions,youcanprovethatmaximumlikelihoodestimationisthesameasleastsquares.1.1.2BayesianviewTheBayesianapproachviewsthemodelparametersasarandomvariableandestimatesthembyusingBayes’theorem.Example1.2Let'scontinueexample1.1,letyiN(xTia;2)again.Hereaandarerandomvariables,notconstants.LetaN(0;2),2Γ(;).OurinterestistheposteriorprobabilityP(ajxi;yi)/P(xi;yija)P(a).WecanusemaximumposteriorestimationorBayesianestimationtosolvea.1-21.2Parametricsvs.NonparametricsInaparametricalmodel,thenumberofparametersisfixedonceandforall,independenttothenumberofthetrainingdata.Inanonparametricalmodel,thenumberofparameterscanchangeaccordingtothenumberoftrainingdata.Example1.3InNearestNeighbormethod,thenumberofparametersisthenumberoftrainingsamples.Sothismodelisnonparametricalmodel.InLogisticRegression,thenumberofparametersisthedimensionofthetrainingsamples.Sothismodelisparametricalmodel.2RandomVariablesandBasicProperties2.1RandomVariablesDenition2.1LetX=(X1;;Xp)Tbearandomvector.Thecumulativedistributionfunction(C.D.F.)associatedwithXis:FX:FX(x)=Pr(Xx)=Pr(X1x1;;Xpxp)Therearetwokindsofrandomvariables:AbsolutelycontinuousDiscrete2.1.1AbsolutelyContinuousRandomVariablesDenition2.2Xisaabsolutelycontinuousrandomvariableifthereexistsaprobabilitydensityfunction(P.D.F.)fX(x)suchthatFX(x)=∫x 1fX(u)duforabsolutelycontinuousvariableswhereFX(1)=∫+1 1f(u)du=1.2.1.2DiscreteRandomVariablesDenition2.3Xisadiscreterandomvariableifthereexistsaprobabilitymassfunction(P.M.F.)suchthatXtakescountable(ornite)setofpointsfXj;j=1;ginwhichP.M.F.is{fX(xj)=P(Xj=xj)fX(x)=0;otherwisewhereP(x2D)=∫Df(u)duandDRp.2.1.3SupportSetDenition2.4SupportsetS=fx2Rp;fX(x)0g1-32.1.4MarginalandConditionalDistributionForx=(x1x2)wherex12Rk,x22Rp k,Denition2.5Marginaldistribution:IfthereexistsPr(X1x1)=FX(x1;;xk;1k+1