a-density-based-algorithm-for-discovering-clusters

wer13000000
1 ℃
2020-05-16

整理文档很辛苦，赏杯茶钱您下走！

还剩 ... 页未读，继续阅读 >>

免费阅读已结束，点击下载阅读编辑剩下 ... 页

阅读已结束，您可以下载文档离线阅读编辑

资源描述

AbstractClusteringalgorithmsareattractiveforthetaskofclassiden-tiﬁcationinspatialdatabases.However,theapplicationtolargespatialdatabasesrisesthefollowingrequirementsforclusteringalgorithms:minimalrequirementsofdomainknowledgetodeterminetheinputparameters,discoveryofclusterswitharbitraryshapeandgoodefﬁciencyonlargeda-tabases.Thewell-knownclusteringalgorithmsoffernosolu-tiontothecombinationoftheserequirements.Inthispaper,wepresentthenewclusteringalgorithmDBSCANrelyingonadensity-basednotionofclusterswhichisdesignedtodis-coverclustersofarbitraryshape.DBSCANrequiresonlyoneinputparameterandsupportstheuserindetermininganap-propriatevalueforit.Weperformedanexperimentalevalua-tionoftheeffectivenessandefﬁciencyofDBSCANusingsyntheticdataandrealdataoftheSEQUOIA2000bench-mark.Theresultsofourexperimentsdemonstratethat(1)DBSCANissigniﬁcantlymoreeffectiveindiscoveringclus-tersofarbitraryshapethanthewell-knownalgorithmCLAR-ANS,andthat(2)DBSCANoutperformsCLARANSbyafactorofmorethan100intermsofefﬁciency.Keywords:ClusteringAlgorithms,ArbitraryShapeofClus-ters,EfﬁciencyonLargeSpatialDatabases,HandlingNoise.1.IntroductionNumerousapplicationsrequirethemanagementofspatialdata,i.e.datarelatedtospace.SpatialDatabaseSystems(SDBS)(Gueting1994)aredatabasesystemsfortheman-agementofspatialdata.Increasinglylargeamountsofdataareobtainedfromsatelliteimages,X-raycrystallographyorotherautomaticequipment.Therefore,automatedknow-ledgediscoverybecomesmoreandmoreimportantinspatialdatabases.Severaltasksofknowledgediscoveryindatabases(KDD)havebeendeﬁnedintheliterature(Matheus,Chan&Pi-atetsky-Shapiro1993).Thetaskconsideredinthispaperisclassidentiﬁcation,i.e.thegroupingoftheobjectsofadata-baseintomeaningfulsubclasses.Inanearthobservationda-tabase,e.g.,wemightwanttodiscoverclassesofhousesalongsomeriver.Clusteringalgorithmsareattractiveforthetaskofclassidentiﬁcation.However,theapplicationtolargespatialdata-basesrisesthefollowingrequirementsforclusteringalgo-rithms:(1)Minimalrequirementsofdomainknowledgetodeter-minetheinputparameters,becauseappropriatevaluesareoftennotknowninadvancewhendealingwithlargedatabases.(2)Discoveryofclusterswitharbitraryshape,becausetheshapeofclustersinspatialdatabasesmaybespherical,drawn-out,linear,elongatedetc.(3)Goodefﬁciencyonlargedatabases,i.e.ondatabasesofsigniﬁcantlymorethanjustafewthousandobjects.Thewell-knownclusteringalgorithmsoffernosolutiontothecombinationoftheserequirements.Inthispaper,wepresentthenewclusteringalgorithmDBSCAN.Itrequiresonlyoneinputparameterandsupportstheuserindetermin-inganappropriatevalueforit.Itdiscoversclustersofarbi-traryshape.Finally,DBSCANisefﬁcientevenforlargespa-tialdatabases.Therestofthepaperisorganizedasfollows.Wediscussclusteringalgorithmsinsection2evaluatingthemaccordingtotheaboverequirements.Insection3,wepresentournotionofclusterswhichisbasedontheconceptofdensityinthedatabase.Section4introducesthealgo-rithmDBSCANwhichdiscoverssuchclustersinaspatialdatabase.Insection5,weperformedanexperimentalevalu-ationoftheeffectivenessandefﬁciencyofDBSCANusingsyntheticdataanddataoftheSEQUOIA2000benchmark.Section6concludeswithasummaryandsomedirectionsforfutureresearch.2.ClusteringAlgorithmsTherearetwobasictypesofclusteringalgorithms(Kaufman&Rousseeuw1990):partitioningandhierarchicalalgo-rithms.Partitioningalgorithmsconstructapartitionofada-tabaseDofnobjectsintoasetofkclusters.kisaninputpa-rameterforthesealgorithms,i.esomedomainknowledgeisrequiredwhichunfortunatelyisnotavailableformanyap-plications.ThepartitioningalgorithmtypicallystartswithaninitialpartitionofDandthenusesaniterativecontrolstrategytooptimizeanobjectivefunction.Eachclusterisrepresentedbythegravitycenterofthecluster(k-meansal-gorithms)orbyoneoftheobjectsoftheclusterlocatednearitscenter(k-medoidalgorithms).Consequently,partitioningalgorithmsuseatwo-stepprocedure.First,determinekrep-resentativesminimizingtheobjectivefunction.Second,as-signeachobjecttotheclusterwithitsrepresentative“clos-est”totheconsideredobject.Thesecondstepimpliesthatapartitionisequivalenttoavoronoidiagramandeachclusteriscontainedinoneofthevoronoicells.Thus,theshapeofallADensity-BasedAlgorithmforDiscoveringClustersinLargeSpatialDatabaseswithNoiseMartinEster,Hans-PeterKriegel,JörgSander,XiaoweiXuInstituteforComputerScience,UniversityofMunichOettingenstr.67,D-80538München,Germany{ester|kriegel|sander|xwxu}@informatik.uni-muenchen.dePublishedinProceedingsof2ndInternationalConferenceonKnowledgeDiscoveryandDataMining(KDD-96)clustersfoundbyapartitioningalgorithmisconvexwhichisveryrestrictive.Ng&Han(1994)explorepartitioningalgorithmsforKDDinspatialdatabases.AnalgorithmcalledCLARANS(ClusteringLargeApplicationsbasedonRANdomizedSearch)isintroducedwhichisanimprovedk-medoidmeth-od.Comparedtoformerk-medoidalgorithms,CLARANSismoreeffectiveandmoreefﬁcient.Anexperimentalevalu-ationindicatesthatCLARANSrunsefﬁcientlyondatabasesofthousandsofobjects.Ng&Han(1994)alsodiscussmeth-odstodeterminethe“natural”numberknatofclustersinadatabase.TheyproposetorunCLARANSonceforeachkfrom2ton.Foreachofthedi