Introduction to data mining-lecture notes chap09_a

整理文档很辛苦,赏杯茶钱您下走!

免费阅读已结束,点击下载阅读编辑剩下 ...

阅读已结束,您可以下载文档离线阅读编辑

资源描述

DataMiningClusterAnalysis:AdvancedConceptsandAlgorithmsLectureNotesforChapter9IntroductiontoDataMiningbyTan,Steinbach,Kumar©Tan,Steinbach,KumarIntroductiontoDataMining4/18/20041©Tan,Steinbach,KumarIntroductiontoDataMining4/18/20042HierarchicalClustering:RevisitedzCreatesnestedclusterszAgglomerativeclusteringalgorithmsvaryintermsofhowtheproximityoftwoclustersarecomputed‹MIN(singlelink):susceptibletonoise/outliers‹MAX/GROUPAVERAGE:maynotworkwellwithnon-globularclusters–CUREalgorithmtriestohandlebothproblemszOftenstartswithaproximitymatrix–Atypeofgraph-basedalgorithm©Tan,Steinbach,KumarIntroductiontoDataMining4/18/20043zUsesanumberofpointstorepresentaclusterzRepresentativepointsarefoundbyselectingaconstantnumberofpointsfromaclusterandthen“shrinking”themtowardthecenteroftheclusterzClustersimilarityisthesimilarityoftheclosestpairofrepresentativepointsfromdifferentclustersCURE:AnotherHierarchicalApproach×שTan,Steinbach,KumarIntroductiontoDataMining4/18/20044CUREzShrinkingrepresentativepointstowardthecenterhelpsavoidproblemswithnoiseandoutlierszCUREisbetterabletohandleclustersofarbitraryshapesandsizes©Tan,Steinbach,KumarIntroductiontoDataMining4/18/20045ExperimentalResults:CUREPicturefromCURE,Guha,Rastogi,Shim.©Tan,Steinbach,KumarIntroductiontoDataMining4/18/20046ExperimentalResults:CUREPicturefromCURE,Guha,Rastogi,Shim.(centroid)(singlelink)©Tan,Steinbach,KumarIntroductiontoDataMining4/18/20047CURECannotHandleDifferingDensitiesOriginalPointsCURE©Tan,Steinbach,KumarIntroductiontoDataMining4/18/20048Graph-BasedClusteringzGraph-Basedclusteringusestheproximitygraph–Startwiththeproximitymatrix–Considereachpointasanodeinagraph–Eachedgebetweentwonodeshasaweightwhichistheproximitybetweenthetwopoints–Initiallytheproximitygraphisfullyconnected–MIN(single-link)andMAX(complete-link)canbeviewedasstartingwiththisgraphzInthesimplestcase,clustersareconnectedcomponentsinthegraph.©Tan,Steinbach,KumarIntroductiontoDataMining4/18/20049Graph-BasedClustering:SparsificationzTheamountofdatathatneedstobeprocessedisdrasticallyreduced–Sparsificationcaneliminatemorethan99%oftheentriesinaproximitymatrix–Theamountoftimerequiredtoclusterthedataisdrasticallyreduced–Thesizeoftheproblemsthatcanbehandledisincreased©Tan,Steinbach,KumarIntroductiontoDataMining4/18/200410Graph-BasedClustering:Sparsification…zClusteringmayworkbetter–Sparsificationtechniqueskeeptheconnectionstothemostsimilar(nearest)neighborsofapointwhilebreakingtheconnectionstolesssimilarpoints.–Thenearestneighborsofapointtendtobelongtothesameclassasthepointitself.–Thisreducestheimpactofnoiseandoutliersandsharpensthedistinctionbetweenclusters.zSparsificationfacilitatestheuseofgraphpartitioningalgorithms(oralgorithmsbasedongraphpartitioningalgorithms.–ChameleonandHypergraph-basedClustering©Tan,Steinbach,KumarIntroductiontoDataMining4/18/200411SparsificationintheClusteringProcess©Tan,Steinbach,KumarIntroductiontoDataMining4/18/200412LimitationsofCurrentMergingSchemeszExistingmergingschemesinhierarchicalclusteringalgorithmsarestaticinnature–MINorCURE:‹mergetwoclustersbasedontheircloseness(orminimumdistance)–GROUP-AVERAGE:‹mergetwoclustersbasedontheiraverageconnectivity©Tan,Steinbach,KumarIntroductiontoDataMining4/18/200413LimitationsofCurrentMergingSchemesClosenessschemeswillmerge(a)and(b)(a)(b)(c)(d)Averageconnectivityschemeswillmerge(c)and(d)©Tan,Steinbach,KumarIntroductiontoDataMining4/18/200414Chameleon:ClusteringUsingDynamicModelingzAdapttothecharacteristicsofthedatasettofindthenaturalclusterszUseadynamicmodeltomeasurethesimilaritybetweenclusters–Mainpropertyistherelativeclosenessandrelativeinter-connectivityofthecluster–Twoclustersarecombinediftheresultingclustersharescertainpropertieswiththeconstituentclusters–Themergingschemepreservesself-similarityzOneoftheareasofapplicationisspatialdata©Tan,Steinbach,KumarIntroductiontoDataMining4/18/200415CharacteristicsofSpatialDataSets•Clustersaredefinedasdenselypopulatedregionsofthespace•Clustershavearbitraryshapes,orientation,andnon-uniformsizes•Differenceindensitiesacrossclustersandvariationindensitywithinclusters•Existenceofspecialartifacts(streaks)andnoiseTheclusteringalgorithmmustaddresstheabovecharacteristicsandalsorequireminimalsupervision.©Tan,Steinbach,KumarIntroductiontoDataMining4/18/200416Chameleon:StepszPreprocessingStep:RepresenttheDatabyaGraph–Givenasetofpoints,constructthek-nearest-neighbor(k-NN)graphtocapturetherelationshipbetweenapointanditsknearestneighbors–Conceptofneighborhoodiscaptureddynamically(evenifregionissparse)zPhase1:Useamultilevelgraphpartitioningalgorithmonthegraphtofindalargenumberofclustersofwell-connectedvertices–Eachclustershouldcontainmostlypointsfromone“true”cluster,i.e.,isasub-clusterofa“real”cluster©Tan,Steinbach,KumarIntroductiontoDataMining4/18/200417Chameleon:Steps…zPhase2:UseHierarchicalAgglomerativeClusteringtomergesub-clusters–Twoclustersarecombinediftheresultingclustersharescertainpropertieswiththeconstituentclusters–Twokeypropertiesusedtomodelclustersimilarity:‹RelativeInterconnectivity:Absol

1 / 37
下载文档,编辑使用

©2015-2020 m.777doc.com 三七文档.

备案号:鲁ICP备2024069028号-1 客服联系 QQ:2149211541

×
保存成功