机器学习十大算法:kNN概要

整理文档很辛苦,赏杯茶钱您下走!

免费阅读已结束,点击下载阅读编辑剩下 ...

阅读已结束,您可以下载文档离线阅读编辑

资源描述

Chapter8kNN:k-NearestNeighborsMichaelSteinbachandPang-NingTanContents8.1Introduction(1518.2DescriptionoftheAlgorithm(1528.2.1High-LevelDescription(1528.2.2Issues(1538.2.3SoftwareImplementations(1558.3Examples(1558.4AdvancedTopics(1578.5Exercises(158Acknowledgments(159References(1598.1IntroductionOneofthesimplestandrathertrivialclassifiersistheRoteclassifier,whichmemorizestheentiretrainingdataandperformsclassificationonlyiftheattributesofthetestobjectexactlymatchtheattributesofoneofthetrainingobjects.Anobviousproblemwiththisapproachisthatmanytestrecordswillnotbeclassifiedbecausetheydonotexactlymatchanyofthetrainingrecords.Anotherissueariseswhentwoormoretrainingrecordshavethesameattributesbutdifferentclasslabels.Amoresophisticatedapproach,k-nearestneighbor(kNNclassification[10,11,21],findsagroupofkobjectsinthetrainingsetthatareclosesttothetestobject,andbasestheassignmentofalabelonthepredominanceofaparticularclassinthisneighborhood.Thisaddressestheissuethat,inmanydatasets,itisunlikelythatoneobjectwillexactlymatchanother,aswellasthefactthatconflictinginformationabouttheclassofanobjectmaybeprovidedbytheobjectsclosesttoit.Thereareseveralkeyelementsofthisapproach:(ithesetoflabeledobjectstobeusedforevaluatingatestobject’sclass,1(iiadistanceorsimilaritymetricthatcanbeusedtocomputeThisneednotbetheentiretrainingset.151152kNN:k-NearestNeighborstheclosenessofobjects,(iiithevalueofk,thenumberofnearestneighbors,and(ivthemethodusedtodeterminetheclassofthetargetobjectbasedontheclassesanddistancesoftheknearestneighbors.Initssimplestform,kNNcaninvolveassigninganobjecttheclassofitsnearestneighbororofthemajorityofitsnearestneighbors,butavarietyofenhancementsarepossibleandarediscussedbelow.Moregenerally,kNNisaspecialcaseofinstance-basedlearning[1].Thisincludescase-basedreasoning[3],whichdealswithsymbolicdata.ThekNNapproachisalsoanexampleofalazylearningtechnique,thatis,atechniquethatwaitsuntilthequeryarrivestogeneralizebeyondthetrainingdata.AlthoughkNNclassificationisaclassificationtechniquethatiseasytounderstandandimplement,itperformswellinmanysituations.Inparticular,awell-knownresultbyCoverandHart[6]showsthattheclassificationerror2ofthenearestneighborruleisboundedabovebytwicetheoptimalBayeserror3undercertainreasonableassump-tions.Furthermore,theerrorofthegeneralkNNmethodasymptoticallyapproachesthatoftheBayeserrorandcanbeusedtoapproximateit.Also,becauseofitssimplicity,kNNiseasytomodifyformorecomplicatedclassifi-cationproblems.Forinstance,kNNisparticularlywell-suitedformultimodalclasses4aswellasapplicationsinwhichanobjectcanhavemanyclasslabels.Toillustratethelastpoint,fortheassignmentoffunctionstogenesbasedonmicroarrayexpressionprofiles,someresearchersfoundthatkNNoutperformedasupportvectormachine(SVMapproach,whichisamuchmoresophisticatedclassificationscheme[17].TheremainderofthischapterdescribesthebasickNNalgorithm,includingvari-ousissuesthataffectbothclassificationandcomputationalperformance.PointersaregiventoimplementationsofkNN,andexamplesofusingtheWekamachinelearn-ingpackagetoperformnearestneighborclassificationarealsoprovided.Advancedtechniquesarediscussedbrieflyandthischapterconcludeswithafewexercises.8.2DescriptionoftheAlgorithm8.2.1High-LevelDescriptionAlgorithm8.1providesahigh-levelsummaryofthenearest-neighborclassificationmethod.GivenatrainingsetDandatestobjectz,whichisavectorofattributevaluesandhasanunknownclasslabel,thealgorithmcomputesthedistance(orsimilarity2Theclassificationerrorofaclassifieristhepercentageofinstancesitincorrectlyclassifies.3TheBayeserroristheclassificationerrorofaBayesclassifier,thatis,aclassifierthatknowstheunderlyingprobabilitydistributionofthedatawithrespecttoclass,andassignseachdatapointtotheclasswiththehighestprobabilitydensityforthatpoint.Formoredetail,see[9].4Withmultimodalclasses,objectsofaparticularclasslabelareconcentratedinseveraldistinctareasofthedataspace,notjustone.Instatisticalterms,theprobabilitydensityfunctionfortheclassdoesnothaveasingle“bump”likeaGaussian,butrather,hasanumberofpeaks.8.2DescriptionoftheAlgorithm153Algorithm8.1BasickNNAlgorithmInput:D,thesetoftrainingobjects,thetestobject,z,whichisavectorofattributevalues,andL,thesetofclassesusedtolabeltheobjectsOutput:cz∈L,theclassofzforeachobjecty∈Ddo|Computed(z,y,thedistancebetweenzandy;endSelectN⊆D,theset(neighborhoodofkclosesttrainingobjectsforz;cz=argmaxv∈Ly∈NI(v=class(cy;whereI(·isanindicatorfunctionthatreturnsthevalue1ifitsargumentistrueand0otherwise.betweenzandallthetrainingobjectstodetermineitsnearest-neighborlist.Itthenassignsaclasstozbytakingtheclassofthemajorityofneighboringobjects.Tiesarebrokeninanunspecifiedmanner,forexample,randomlyorbytakingthemostfrequentclassinthetrainingset.ThestoragecomplexityofthealgorithmisO(n,wherenisthenumberoftrainingobjects.ThetimecomplexityisalsoO(n,sincethedistanceneedstobecomputedbetweenthetargetandeachtrainingobject.However,thereisnotimetakenfortheconstructionoftheclassificationmodel,forexample,adecisiontreeorsepa-ratinghyperplane.Thus,kNNisdifferentfrommostotherclassificationtechniqueswhichhavemoderatelytoqu

1 / 17
下载文档,编辑使用

©2015-2020 m.777doc.com 三七文档.

备案号:鲁ICP备2024069028号-1 客服联系 QQ:2149211541

×
保存成功