Introduction to data mining-lecture notes chap09_a

goanfighter
1 ℃
2020-06-25

整理文档很辛苦，赏杯茶钱您下走！

还剩 ... 页未读，继续阅读 >>

免费阅读已结束，点击下载阅读编辑剩下 ... 页

阅读已结束，您可以下载文档离线阅读编辑

资源描述

DataMiningClusterAnalysis:AdvancedConceptsandAlgorithmsLectureNotesforChapter9IntroductiontoDataMiningbyTan,Steinbach,Kumar©Tan,Steinbach,KumarIntroductiontoDataMining4/18/20041©Tan,Steinbach,KumarIntroductiontoDataMining4/18/20042HierarchicalClustering:RevisitedzCreatesnestedclusterszAgglomerativeclusteringalgorithmsvaryintermsofhowtheproximityoftwoclustersarecomputedMIN(singlelink):susceptibletonoise/outliersMAX/GROUPAVERAGE:maynotworkwellwithnon-globularclusters–CUREalgorithmtriestohandlebothproblemszOftenstartswithaproximitymatrix–Atypeofgraph-basedalgorithm©Tan,Steinbach,KumarIntroductiontoDataMining4/18/20043zUsesanumberofpointstorepresentaclusterzRepresentativepointsarefoundbyselectingaconstantnumberofpointsfromaclusterandthen“shrinking”themtowardthecenteroftheclusterzClustersimilarityisthesimilarityoftheclosestpairofrepresentativepointsfromdifferentclustersCURE:AnotherHierarchicalApproach××©Tan,Steinbach,KumarIntroductiontoDataMining4/18/20044CUREzShrinkingrepresentativepointstowardthecenterhelpsavoidproblemswithnoiseandoutlierszCUREisbetterabletohandleclustersofarbitraryshapesandsizes©Tan,Steinbach,KumarIntroductiontoDataMining4/18/20045ExperimentalResults:CUREPicturefromCURE,Guha,Rastogi,Shim.©Tan,Steinbach,KumarIntroductiontoDataMining4/18/20046ExperimentalResults:CUREPicturefromCURE,Guha,Rastogi,Shim.(centroid)(singlelink)©Tan,Steinbach,KumarIntroductiontoDataMining4/18/20047CURECannotHandleDifferingDensitiesOriginalPointsCURE©Tan,Steinbach,KumarIntroductiontoDataMining4/18/20048Graph-BasedClusteringzGraph-Basedclusteringusestheproximitygraph–Startwiththeproximitymatrix–Considereachpointasanodeinagraph–Eachedgebetweentwonodeshasaweightwhichistheproximitybetweenthetwopoints–Initiallytheproximitygraphisfullyconnected–MIN(single-link)andMAX(complete-link)canbeviewedasstartingwiththisgraphzInthesimplestcase,clustersareconnectedcomponentsinthegraph.©Tan,Steinbach,KumarIntroductiontoDataMining4/18/20049Graph-BasedClustering:SparsificationzTheamountofdatathatneedstobeprocessedisdrasticallyreduced–Sparsificationcaneliminatemorethan99%oftheentriesinaproximitymatrix–Theamountoftimerequiredtoclusterthedataisdrasticallyreduced–Thesizeoftheproblemsthatcanbehandledisincreased©Tan,Steinbach,KumarIntroductiontoDataMining4/18/200410Graph-BasedClustering:Sparsification…zClusteringmayworkbetter–Sparsificationtechniqueskeeptheconnectionstothemostsimilar(nearest)neighborsofapointwhilebreakingtheconnectionstolesssimilarpoints.–Thenearestneighborsofapointtendtobelongtothesameclassasthepointitself.–Thisreducestheimpactofnoiseandoutliersandsharpensthedistinctionbetweenclusters.zSparsificationfacilitatestheuseofgraphpartitioningalgorithms(oralgorithmsbasedongraphpartitioningalgorithms.–ChameleonandHypergraph-basedClustering©Tan,Steinbach,KumarIntroductiontoDataMining4/18/200411SparsificationintheClusteringProcess©Tan,Steinbach,KumarIntroductiontoDataMining4/18/200412LimitationsofCurrentMergingSchemeszExistingmergingschemesinhierarchicalclusteringalgorithmsarestaticinnature–MINorCURE:mergetwoclustersbasedontheircloseness(orminimumdistance)–GROUP-AVERAGE:mergetwoclustersbasedontheiraverageconnectivity©Tan,Steinbach,KumarIntroductiontoDataMining4/18/200413LimitationsofCurrentMergingSchemesClosenessschemeswillmerge(a)and(b)(a)(b)(c)(d)Averageconnectivityschemeswillmerge(c)and(d)©Tan,Steinbach,KumarIntroductiontoDataMining4/18/200414Chameleon:ClusteringUsingDynamicModelingzAdapttothecharacteristicsofthedatasettofindthenaturalclusterszUseadynamicmodeltomeasurethesimilaritybetweenclusters–Mainpropertyistherelativeclosenessandrelativeinter-connectivityofthecluster–Twoclustersarecombinediftheresultingclustersharescertainpropertieswiththeconstituentclusters–Themergingschemepreservesself-similarityzOneoftheareasofapplicationisspatialdata©Tan,Steinbach,KumarIntroductiontoDataMining4/18/200415CharacteristicsofSpatialDataSets•Clustersaredefinedasdenselypopulatedregionsofthespace•Clustershavearbitraryshapes,orientation,andnon-uniformsizes•Differenceindensitiesacrossclustersandvariationindensitywithinclusters•Existenceofspecialartifacts(streaks)andnoiseTheclusteringalgorithmmustaddresstheabovecharacteristicsandalsorequireminimalsupervision.©Tan,Steinbach,KumarIntroductiontoDataMining4/18/200416Chameleon:StepszPreprocessingStep:RepresenttheDatabyaGraph–Givenasetofpoints,constructthek-nearest-neighbor(k-NN)graphtocapturetherelationshipbetweenapointanditsknearestneighbors–Conceptofneighborhoodiscaptureddynamically(evenifregionissparse)zPhase1:Useamultilevelgraphpartitioningalgorithmonthegraphtofindalargenumberofclustersofwell-connectedvertices–Eachclustershouldcontainmostlypointsfromone“true”cluster,i.e.,isasub-clusterofa“real”cluster©Tan,Steinbach,KumarIntroductiontoDataMining4/18/200417Chameleon:Steps…zPhase2:UseHierarchicalAgglomerativeClusteringtomergesub-clusters–Twoclustersarecombinediftheresultingclustersharescertainpropertieswiththeconstituentclusters–Twokeypropertiesusedtomodelclustersimilarity:RelativeInterconnectivity:Absol