自动孔子第三章

golem0907
2 ℃
2020-03-09

整理文档很辛苦，赏杯茶钱您下走！

还剩 ... 页未读，继续阅读 >>

免费阅读已结束，点击下载阅读编辑剩下 ... 页

阅读已结束，您可以下载文档离线阅读编辑

资源描述

PatternRecognitionChapter3DataClustering(PatternRecognitionConcepts,MethodsandApplicationsJ.P.MarquesdeSaSpringer)Chapter3DataClustering•TeachingPurpose:Theconceptofclusteringisintroduced.Besides,somecommonclusteringmethodsandconditionsarerepresented.•Request：Grasptheconceptofclusteringandmastersomecommonclusteringmethods.•Focus：TheStandardizationIssue;Treeclustering;K-MeansClusteringContents•UnsupervisedClassification•TheStandardizationIssue•TreeClustering•DimensionalReduction•K-MeansClustering•ClusterValidation3.1UnsupervisedClassification•Unsupervisedclassification:Theclassificationsystemsdesigneddesignedusingasocalledunsupervisedapproach(classifyingthingswithoutanypreviousknowledge)todiscovertheinternalsimilaritystructureofthepatternsinausefulway.•Thepresentchapterisdedicatedtotheunsupervisedclassificationoffeaturevectors,alsocalleddataclustering.Regardingthetypeofamalgamationapproachtwocategoriesofalgorithmscanbeconsidered:•Hierarchicalalgorithms•Centroidadjustmentalgorithms3.2TheStandardizationIssueThereisanarbitrarilylargenumberofdataclusteringsolutionsthatcanberadicallydifferentfromeachother.Asanexample,considerthe+Crossdata,whichisrepresentedinFigure3.1.Figure3.1CrossdatawithEuclidianclusteringImaginethatwearesearchingfora2-clustersolutionthatminimizesthewithin-classaverageerror:iyxciiyxcedisnE,1),(tan1(3-1)withdifferentpairsofpatternsx,yincluster.smxyii/)())min()/(max())min((iiiiixxxxy))min()/(max(iiiixxxyaxyii/(3-2a)(3-2b)(3-2c)(3-2d)3.3TreeClusteringConsidertheGlobulardatashowninFigure3.2aandassumethatthesimilaritybetweentwoclustersisassessedbymeasuringthesimilarityofitsfurthestpairofpatterns.ThisisillustratedinthebinarytreeshownasaverticalicicleplotinFigure3.3a,forthecompletelinkagemergingprocessofthe18patterns.Figure3.2(a)Globular(b)filamentarydatasetsforcomparisonofclusteringmethods.Theverticalicicleplotrepresentsthehierarchicalclusteringtreeandmustbeinspectedbottom-up.Figure3.3bshowstheclusteringschedulegraph.Theseusuallycorrespondtoaplateaubeforeahighjumpinthedistancemeasure.Figure3.3(a)Verticalicicleplotfortheglobulardata;(b)Clusteringschedulegraph.Thesplittingversionofthehierarchicalclusteringalgorithm,previouslymentioned,operatesinatop-downfashion.Inthiswayadendrogramwithinversestructure,startingwithoneclusterandfinishingwithnclusters,isobtained.•SinglelinkageAlsocalledNN(nearestneighbor)rule,thesinglelinkageruleevaluatesthedissimilaritybetweentwoclusters,asthedissimilarityofthenearestpatterns,onefromeachcluster:yxdjiyxji,,min),((3-3a)•AveragelinkagebetweengroupsAlsoknownasUPGMA(un-weightedpair-groupmethodusingarithmeticaverages),thisruleassessesthedistancebetweentwoclustersastheaverageofthedistancesbetweenallpairsofpatterns,eachpatternfromadistinctcluster:•AveragelinkagewithingroupsThisrule,knownasUWGM(un-weightedwithin-groupmmethodusingarithmeticaverages),isavariantofthepreviousone,wherethedistancebetweentwoclustersistakentobetheijxyjijiyxnnd1),(ijxyjijiyxnnd1),(ijxyjijiyxnnd1),((3-3b)averageofthedistancesofallpossiblepairsofpatterns,asiftheyformedasinglecluster:•Ward'smethodInWard'smethodthesumofthesquaredwithin-clusterdistances,fortheresultingmergedclusteriscomputed:),(,)2,(1),(jiyxjijiyxnnCd(3-3c)2,1),(jixjijimxnnd(3-3d)wheremisthecentroidofthemergedclusters.3.4DimensionalReductionLetusconsidertheeigenvectorsofthecorkstoppersdata(c=2)andretainthefirsttwoprincipalcomponentsorfactors.Thecoefficientsneededforthetransformationinatwo-dimensionalspacewithnewfeatures(factors)Factor1andFactor2,asalinearcombinationoftheoriginalfeaturesareshowninFigure3.4a.TherepresentationofthepatternsinthisnewspaceisshowninFigure3.4b.Therelationbetweenthefactorsandtheoriginalfeaturescanbeappreciatedthroughtherespectivecorrelationvalues,alsocalledfactorloadings,showninFigure3.5a.SignificantvaluesappearFigure3.4.Dimensionalityreductionofthefirsttwoclassesofcorkstoppersusingtwoeigenvectors.(a)Eigenvectorcoefficients;(b)Eigenvectorscatterplotinblack.AplotofthefactorloadingsisshowninFigure3.5b.MultidimensionalscalingisanothermethodofrepresentingdatainasmallernumberofdimensionspreservingasmuchasFigure3.5.Factorloadingstable(a)andgraph(b)forthefirsttwoclassesofcorkstoppers.possiblethesimilaritystructureofthedata.Thefollowingquadraticerrormeasureknownasstressisiterativelyminimized:2,*))],((),([jijijixxdfxxdp(3-4)where:-xi,xjareanyarbitrarypairofdistinctpatterns(i≠j)-d(xi,xj)istheoriginaldissimilaritybetweenxiandxj;-d*(xi,xj)isthetransformeddissimilarityinthelowerdimensionalspace;-fisamonotonictransformationfunction.3.5K-MeansClusteringThecentralideaofthek-MeansorISODATAclusteringalgorithmistominimizeanoverallwithin-clusterdistancefromthepatternstothecentroids.Alocalminimumoftheoverallwithin-clusterdistanceissearchedbyiterativelyadjustingcclustercentroids,mi,andbyassigningeachpatterntotheclosestcentroid:21cjxjijimxE(3-5)Asfortheinitialchoiceofcentroids,itcanbedoneinseveralways.StatisticsandSPSSprovidethefollowingalternatives:•Specifywhic

自动孔子 第三章

自动孔子第三章