Instructor:AssociateProf.YingLiuGraduateUniversityofChineseAcademyofSciencesFictitiousEconomyandDataScienceResearchCenterDataMining2011/10/312ReviewDatamining—coreofknowledgediscoveryprocessDataCleaningandIntegrationDatabasesDataWarehouseSelectionandTransformationDataMiningPatternEvaluationFlatfiles2011/10/313ClusterAnalysisWhatisClusterAnalysis?TypesofDatainClusterAnalysisACategorizationofMajorClusteringMethodsPartitioningMethodsHierarchicalMethodsDensity-BasedMethodsGrid-BasedMethodsOutlierAnalysisSummary2011/10/314WhatisClusterAnalysis?Cluster:acollectionofdataobjectsSimilartooneanotherwithinthesameclusterDissimilartotheobjectsinotherclustersClusteranalysisFindingsimilaritiesbetweendataaccordingtothecharacteristicsfoundinthedataandgroupingsimilardataobjectsintoclustersUnsupervisedlearning:nopredefinedclassesTypicalapplicationsAsastand-alonetooltogetinsightintodatadistributionAsapreprocessingstepforotheralgorithms2011/10/315Clustering:RichApplicationsandMultidisciplinaryEffortsPatternRecognitionGISCreatethematicmapsinGISbyclusteringfeaturespacesDetectspatialclustersorforotherspatialminingtasksImageProcessingEconomicScience(especiallymarketresearch)SoftwarepackageS-Plus,SPSS,SAS2011/10/316ExamplesofClusteringApplicationsMarketing:Helpmarketersdiscoverdistinctgroupsintheircustomerbases,andthenusethisknowledgetodeveloptargetedmarketingprogramsLanduse:IdentificationofareasofsimilarlanduseinanearthobservationdatabaseInsurance:IdentifyinggroupsofmotorinsurancepolicyholderswithahighaverageclaimcostCity-planning:Identifyinggroupsofhousesaccordingtotheirhousetype,value,andgeographicallocation2011/10/317ExamplesofClusteringApplicationsEarth-quakestudies:ObservedearthquakeepicentersshouldbeclusteredalongcontinentfaultsBiology:categorizegeneswithsimilarfunctionalityDocumentclassificationClusterWeblogdatatodiscovergroupsofsimilaraccessingpatterns2011/10/318WhatIsGoodClustering?Agoodclusteringmethodwillproducehighqualityclusterswithhighintra-classsimilaritylowinter-classsimilarityThequalityofaclusteringresultdependsonboththesimilaritymeasureusedbythemethodanditsimplementationThequalityofaclusteringmethodisalsomeasuredbyitsabilitytodiscoversomeorallofthehiddenpatterns2011/10/319RequirementsofClusteringScalabilityAbilitytodealwithdifferenttypesofattributesAbilitytohandledynamicdataDiscoveryofclusterswitharbitraryshapeMinimalrequirementsfordomainknowledgetodetermineinputparametersAbletodealwithnoiseandoutliersInsensitivetoorderofinputrecordsHighdimensionalityIncorporationofuser-specifiedconstraintsInterpretabilityandusability2011/10/3110MeasuretheQualityofClusteringDissimilarity/Similaritymetric:Similarityisexpressedintermsofadistancefunction,typicallymetric:d(i,j)Thedefinitionsofdistancefunctionsareusuallyverydifferentforinterval-scaled,boolean,categorical,ordinalratio,andvectorvariablesWeightsmaybeassignedtodifferentvariablesbasedonapplicationsanddatasemanticsItishardtodefine―similarenough‖or―goodenough‖Theansweristypicallyhighlysubjective2011/10/3111ClusterAnalysisWhatisClusterAnalysis?TypesofDatainClusterAnalysisACategorizationofMajorClusteringMethodsPartitioningMethodsHierarchicalMethodsDensity-BasedMethodsGrid-BasedMethodsOutlierAnalysisSummary2011/10/3112DataStructuresDatamatrix(twomodes)Dissimilaritymatrix(onemode)npx...nfx...n1x...............ipx...ifx...i1x...............1px...1fx...11x0...)2,()1,(:::)2,3()...ndnd0dd(3,10d(2,1)02011/10/3113TypeofDatainClusteringAnalysisInterval-scaledvariablesBinaryvariablesNominalvariablesOrdinalvariablesratiovariablesVariablesofmixedtypes2011/10/3114Interval-valuedVariablesStandardizedataCalculatethemeanabsolutedeviation:whereCalculatethestandardizedmeasurement(z-score)Usingmeanabsolutedeviationismorerobustthanusingstandarddeviation.)...211nffffxx(xnm|)|...|||(|121fnffffffmxmxmxnsffififsmxz2011/10/3115SimilarityandDissimilarityBetweenObjectsDistancesarenormallyusedtomeasurethesimilarityordissimilaritybetweentwodataobjectsSomepopularonesinclude:Minkowskidistance:wherei=(xi1,xi2,…,xip)andj=(xj1,xj2,…,xjp)aretwop-dimensionaldataobjects,andqisapositiveintegerIfq=1,disManhattandistanceqqppqqjxixjxixjxixjid)||...|||(|),(2211||...||||),(2211ppjxixjxixjxixjid2011/10/3116SimilarityandDissimilarityBetweenObjectsIfq=2,disEuclideandistance:Properties•d(i,j)0•d(i,i)=0•d(i,j)=d(j,i)•d(i,j)d(i,k)+d(k,j)Also,onecanuseweighteddistance)||...|||(|),(2222211ppjxixjxixjxixjid2011/10/3117BinaryVariablesAcontingencytableforbinarydataDistancemeasureforsymmetricbinaryvariables:Distancemeasureforasymmetricbinaryvariables:dcbacbjid),(cbacbjid),(pdbcasumdcdcbabasum0101ObjectiObjectj2011/10/3118DissimilaritybetweenBinaryVariablesExamplegenderisasymmetricattributetheremainingattributesareasymmetricbinaryletthevaluesYandPbesetto1,andthevalueNbesetto0NameGenderFeverCoughTest-1Test-2Test-3Test-4JackMYNPNNNMaryFYNPNPNJimMYPNNNN75.02112