PatternRecognitionChapter3DataClustering(PatternRecognitionConcepts,MethodsandApplicationsJ.P.MarquesdeSaSpringer)Chapter3DataClustering•TeachingPurpose:Theconceptofclusteringisintroduced.Besides,somecommonclusteringmethodsandconditionsarerepresented.•Request:Grasptheconceptofclusteringandmastersomecommonclusteringmethods.•Focus:TheStandardizationIssue;Treeclustering;K-MeansClusteringContents•UnsupervisedClassification•TheStandardizationIssue•TreeClustering•DimensionalReduction•K-MeansClustering•ClusterValidation3.1UnsupervisedClassification•Unsupervisedclassification:Theclassificationsystemsdesigneddesignedusingasocalledunsupervisedapproach(classifyingthingswithoutanypreviousknowledge)todiscovertheinternalsimilaritystructureofthepatternsinausefulway.•Thepresentchapterisdedicatedtotheunsupervisedclassificationoffeaturevectors,alsocalleddataclustering.Regardingthetypeofamalgamationapproachtwocategoriesofalgorithmscanbeconsidered:•Hierarchicalalgorithms•Centroidadjustmentalgorithms3.2TheStandardizationIssueThereisanarbitrarilylargenumberofdataclusteringsolutionsthatcanberadicallydifferentfromeachother.Asanexample,considerthe+Crossdata,whichisrepresentedinFigure3.1.Figure3.1CrossdatawithEuclidianclusteringImaginethatwearesearchingfora2-clustersolutionthatminimizesthewithin-classaverageerror:iyxciiyxcedisnE,1),(tan1(3-1)withdifferentpairsofpatternsx,yincluster.smxyii/)())min()/(max())min((iiiiixxxxy))min()/(max(iiiixxxyaxyii/(3-2a)(3-2b)(3-2c)(3-2d)3.3TreeClusteringConsidertheGlobulardatashowninFigure3.2aandassumethatthesimilaritybetweentwoclustersisassessedbymeasuringthesimilarityofitsfurthestpairofpatterns.ThisisillustratedinthebinarytreeshownasaverticalicicleplotinFigure3.3a,forthecompletelinkagemergingprocessofthe18patterns.Figure3.2(a)Globular(b)filamentarydatasetsforcomparisonofclusteringmethods.Theverticalicicleplotrepresentsthehierarchicalclusteringtreeandmustbeinspectedbottom-up.Figure3.3bshowstheclusteringschedulegraph.Theseusuallycorrespondtoaplateaubeforeahighjumpinthedistancemeasure.Figure3.3(a)Verticalicicleplotfortheglobulardata;(b)Clusteringschedulegraph.Thesplittingversionofthehierarchicalclusteringalgorithm,previouslymentioned,operatesinatop-downfashion.Inthiswayadendrogramwithinversestructure,startingwithoneclusterandfinishingwithnclusters,isobtained.•SinglelinkageAlsocalledNN(nearestneighbor)rule,thesinglelinkageruleevaluatesthedissimilaritybetweentwoclusters,asthedissimilarityofthenearestpatterns,onefromeachcluster:yxdjiyxji,,min),((3-3a)•AveragelinkagebetweengroupsAlsoknownasUPGMA(un-weightedpair-groupmethodusingarithmeticaverages),thisruleassessesthedistancebetweentwoclustersastheaverageofthedistancesbetweenallpairsofpatterns,eachpatternfromadistinctcluster:•AveragelinkagewithingroupsThisrule,knownasUWGM(un-weightedwithin-groupmmethodusingarithmeticaverages),isavariantofthepreviousone,wherethedistancebetweentwoclustersistakentobetheijxyjijiyxnnd1),(ijxyjijiyxnnd1),(ijxyjijiyxnnd1),((3-3b)averageofthedistancesofallpossiblepairsofpatterns,asiftheyformedasinglecluster:•Ward'smethodInWard'smethodthesumofthesquaredwithin-clusterdistances,fortheresultingmergedclusteriscomputed:),(,)2,(1),(jiyxjijiyxnnCd(3-3c)2,1),(jixjijimxnnd(3-3d)wheremisthecentroidofthemergedclusters.3.4DimensionalReductionLetusconsidertheeigenvectorsofthecorkstoppersdata(c=2)andretainthefirsttwoprincipalcomponentsorfactors.Thecoefficientsneededforthetransformationinatwo-dimensionalspacewithnewfeatures(factors)Factor1andFactor2,asalinearcombinationoftheoriginalfeaturesareshowninFigure3.4a.TherepresentationofthepatternsinthisnewspaceisshowninFigure3.4b.Therelationbetweenthefactorsandtheoriginalfeaturescanbeappreciatedthroughtherespectivecorrelationvalues,alsocalledfactorloadings,showninFigure3.5a.SignificantvaluesappearFigure3.4.Dimensionalityreductionofthefirsttwoclassesofcorkstoppersusingtwoeigenvectors.(a)Eigenvectorcoefficients;(b)Eigenvectorscatterplotinblack.AplotofthefactorloadingsisshowninFigure3.5b.MultidimensionalscalingisanothermethodofrepresentingdatainasmallernumberofdimensionspreservingasmuchasFigure3.5.Factorloadingstable(a)andgraph(b)forthefirsttwoclassesofcorkstoppers.possiblethesimilaritystructureofthedata.Thefollowingquadraticerrormeasureknownasstressisiterativelyminimized:2,*))],((),([jijijixxdfxxdp(3-4)where:-xi,xjareanyarbitrarypairofdistinctpatterns(i≠j)-d(xi,xj)istheoriginaldissimilaritybetweenxiandxj;-d*(xi,xj)isthetransformeddissimilarityinthelowerdimensionalspace;-fisamonotonictransformationfunction.3.5K-MeansClusteringThecentralideaofthek-MeansorISODATAclusteringalgorithmistominimizeanoverallwithin-clusterdistancefromthepatternstothecentroids.Alocalminimumoftheoverallwithin-clusterdistanceissearchedbyiterativelyadjustingcclustercentroids,mi,andbyassigningeachpatterntotheclosestcentroid:21cjxjijimxE(3-5)Asfortheinitialchoiceofcentroids,itcanbedoneinseveralways.StatisticsandSPSSprovidethefollowingalternatives:•Specifywhic