快速k均值聚类SCI文献:Adaptive-Sampling-for-k-Means-Cluster

整理文档很辛苦,赏杯茶钱您下走!

免费阅读已结束,点击下载阅读编辑剩下 ...

阅读已结束,您可以下载文档离线阅读编辑

资源描述

AdaptiveSamplingfork-MeansClusteringAnkitAggarwal1,AmitDeshpande2,andRaviKannan21IITDelhizenithankit@gmail.com2MicrosoftResearchIndia{amitdesh,kannan}@microsoft.eduAbstract.WeshowthatadaptivelysampledO(k)centersgiveacon-stantfactorbi-criteriaapproximationforthek-meansproblem,withaconstantprobability.Moreover,theseO(k)centerscontainasubsetofkcenterswhichgiveaconstantfactorapproximation,andcanbefoundus-ingLP-basedtechniquesofJainandVazirani[JV01]andCharikaretal.[CGTS02].BoththesealgorithmsrunineffectivelyO(nkd)timeandex-tendtheO(logk)-approximationachievedbythek-means++algorithmofArthurandVassilvitskii[AV07].1Introductionk-meansisapopularobjectivefunctionusedforclusteringproblemsincomputervision,machinelearningandcomputationalgeometry.Thek-meansclusteringproblemongivenndatapointsasksforasetofkcentersthatminimizesthesumofsquareddistancesbetweeneachpointanditsnearestcenter.Towriteitformally,thek-meansproblemasks:GivenasetX⊆Rdofndatapointsandanintegerk0,findasetC⊆Rdofkcentersthatminimizesthefollowingpotentialfunction.φ(C)=x∈Xminc∈Cx−c2WedenotebyφA(C)=x∈Aminc∈Cx−c2thecontributionofpointsinasubsetA⊆X.LetCOPTbethesetofoptimalkcenters.Intheoptimalsolution,eachpointofXisassignedtoitsnearestcenterinCOPT.ThisinducesanaturalpartitiononXasA1∪A2∪···∪Akintodisjointsubsets.Thereisavariantofthek-meansproblemknownasthediscretek-meansproblemwherethecentershavetobepointsfromXitself.Notethattheoptimaofthek-meansproblemanditsdiscretevariantarewithinconstantfactorsofeachother.Thereareothervariantswheretheobjectiveistominimizethesumofp-thpowersofdistancesinsteadofsquares(forp≥1),ortobemoreprecise,x∈Xminc∈Cx−cp1/p.Thep=1caseisknownasthek-medianproblemandthep=∞caseisknownasthek-centerproblem.Moreover,onecanalsoaskthediscretek-meansproblemoverarbitrarymetricspacesinsteadofRd.I.Dinuretal.(Eds.):APPROXandRANDOM2009,LNCS5687,pp.15–28,2009.cSpringer-VerlagBerlinHeidelberg200916A.Aggarwal,A.Deshpande,andR.Kannan1.1PreviousWorkItisNP-hardtosolvethek-meansproblemexactly,evenfork=2[ADHP09],[Das08,KNV08]andevenintheplane[MNV09].Constantfactorapproximationalgorithmsareknownbasedonlinearprogrammingtechniquesusedforfacilitylocationproblemsbuttheirrunningtimeissuper-linearinn[JV01].Kanugoetal.[KMN+04]givea(9+)-approximationvialocalsearchbutinrunningtimeO(n3−d)thathasexponentialdependenceond.Therearepolynomialtimeapproximationschemeswithrunningtimelinearinnanddbutexponentialorworseink[dlVKKR03,HPM04,KSS04,Mat00,Che09].Suchadependenceonkmaywellbeunavoidable,asshowninthecaseofthediscretek-medianproblem[GI03].Ontheotherhand,themostpopularalgorithmforthek-meansproblemisasimpleiterative-refinementheuristicduetoLloyd[Llo82]:startwithkarbitrary(orrandom)centers,computetheclustersdefinedbythem,definethemeansoftheseclustersasthenewcenters,re-computeclustersandrepeat.Lloyd’smethodisfastinpracticebutisguaranteedtoconvergeonlytoalocaloptimum.Intheory,theworst-caserunningtimeofLloyd’sheuristicisexponentialevenintheplane[Vat09];however,aplausibleexplanationforitspopularitycouldbeitspolynomialsmoothedcomplexity[AMR09].Inattemptstobridgethisgapbetweentheoryandpractice,severalrandom-izedalgorithmshavebeenproposedbasedontheideaofsamplingasubsetofpointsascenterstogetaconstantfactorapproximationintimeeffectivelyO(nkd).ThesecenterscouldthenbeusedtoinitializetheLloyd’smethod.MettuandPlaxton[MP02]andOstrovskyetal.[ORSS06]giveconstantfactorapprox-imationsbuttheirresultsdonotworkunconditionallyforalldatasets.Themostrelevanttoourpaperisarandomizedalgorithmcalledk-means++duetoArthurandVassilvitskii[AV07].Theyproposeasimpleadaptivesam-plingscheme(theycallitasD2sampling):ineachstep,pickapointwithprobabilityproportionaltoitscurrentcost(i.e,itssquareddistancetothenear-estcenterpickedsofar)andadditasanewcenter.Thisissimilartoagreedy2-approximationalgorithmforthek-centerproblemthatpicksapointwiththemaximumcostineachstep[Gon85].ArthurandVassilvitskiishowthatadaptivelysampledkcentersgive,inexpectation,anO(logk)-approximationforthek-meansproblem.Thisalsomeans,byMarkovinequality,thatwegetanO(logk)-approximationwithaconstantprobability.Similarsamplingschemeshaveappearedintheliteratureonclusteringofdatastreams[GMM+03,COP03]andonlinefacilitylocation[Mey01].However,thesesamplingschemesarenotassimpleandtheiranalysisisquitedifferent.ArthurandVassilvitskii’sanalysisoftheirO(logk)-approximationreliesheav-ilyonanon-trivialinductionargument(Lemma3.3of[AV07]).Reverseengi-neeringthesameargument,theyshowalowerboundexamplewhereadaptivelysampledkcentersgiveΩ(logk)-approximation,inexpectation.However,theirlowerboundismisleadinginthesensethateventhoughtheexpectederrorforadaptivesamplingonthisexampleishigh,itgivesanO(1)-approximationwithhighprobability.Thestartingpointforourworkwasthefollowingquestion:DoAdaptiveSamplingfork-MeansClustering17adaptivelysampledkcentersalwaysgiveaconstantfactorapproximation,withaconstantprobability?1.2OurResultsInSection2,weextendtheresultsofArthurandVassilvitskiitoshowthatadaptivelysampledO(k)centersgiveaconstantfactorbi-criteriaapproximationforthek-meansproblem,w

1 / 14
下载文档,编辑使用

©2015-2020 m.777doc.com 三七文档.

备案号:鲁ICP备2024069028号-1 客服联系 QQ:2149211541

×
保存成功