a-density-based-algorithm-for-discovering-clusters

整理文档很辛苦,赏杯茶钱您下走!

免费阅读已结束,点击下载阅读编辑剩下 ...

阅读已结束,您可以下载文档离线阅读编辑

资源描述

AbstractClusteringalgorithmsareattractiveforthetaskofclassiden-tificationinspatialdatabases.However,theapplicationtolargespatialdatabasesrisesthefollowingrequirementsforclusteringalgorithms:minimalrequirementsofdomainknowledgetodeterminetheinputparameters,discoveryofclusterswitharbitraryshapeandgoodefficiencyonlargeda-tabases.Thewell-knownclusteringalgorithmsoffernosolu-tiontothecombinationoftheserequirements.Inthispaper,wepresentthenewclusteringalgorithmDBSCANrelyingonadensity-basednotionofclusterswhichisdesignedtodis-coverclustersofarbitraryshape.DBSCANrequiresonlyoneinputparameterandsupportstheuserindetermininganap-propriatevalueforit.Weperformedanexperimentalevalua-tionoftheeffectivenessandefficiencyofDBSCANusingsyntheticdataandrealdataoftheSEQUOIA2000bench-mark.Theresultsofourexperimentsdemonstratethat(1)DBSCANissignificantlymoreeffectiveindiscoveringclus-tersofarbitraryshapethanthewell-knownalgorithmCLAR-ANS,andthat(2)DBSCANoutperformsCLARANSbyafactorofmorethan100intermsofefficiency.Keywords:ClusteringAlgorithms,ArbitraryShapeofClus-ters,EfficiencyonLargeSpatialDatabases,HandlingNoise.1.IntroductionNumerousapplicationsrequirethemanagementofspatialdata,i.e.datarelatedtospace.SpatialDatabaseSystems(SDBS)(Gueting1994)aredatabasesystemsfortheman-agementofspatialdata.Increasinglylargeamountsofdataareobtainedfromsatelliteimages,X-raycrystallographyorotherautomaticequipment.Therefore,automatedknow-ledgediscoverybecomesmoreandmoreimportantinspatialdatabases.Severaltasksofknowledgediscoveryindatabases(KDD)havebeendefinedintheliterature(Matheus,Chan&Pi-atetsky-Shapiro1993).Thetaskconsideredinthispaperisclassidentification,i.e.thegroupingoftheobjectsofadata-baseintomeaningfulsubclasses.Inanearthobservationda-tabase,e.g.,wemightwanttodiscoverclassesofhousesalongsomeriver.Clusteringalgorithmsareattractiveforthetaskofclassidentification.However,theapplicationtolargespatialdata-basesrisesthefollowingrequirementsforclusteringalgo-rithms:(1)Minimalrequirementsofdomainknowledgetodeter-minetheinputparameters,becauseappropriatevaluesareoftennotknowninadvancewhendealingwithlargedatabases.(2)Discoveryofclusterswitharbitraryshape,becausetheshapeofclustersinspatialdatabasesmaybespherical,drawn-out,linear,elongatedetc.(3)Goodefficiencyonlargedatabases,i.e.ondatabasesofsignificantlymorethanjustafewthousandobjects.Thewell-knownclusteringalgorithmsoffernosolutiontothecombinationoftheserequirements.Inthispaper,wepresentthenewclusteringalgorithmDBSCAN.Itrequiresonlyoneinputparameterandsupportstheuserindetermin-inganappropriatevalueforit.Itdiscoversclustersofarbi-traryshape.Finally,DBSCANisefficientevenforlargespa-tialdatabases.Therestofthepaperisorganizedasfollows.Wediscussclusteringalgorithmsinsection2evaluatingthemaccordingtotheaboverequirements.Insection3,wepresentournotionofclusterswhichisbasedontheconceptofdensityinthedatabase.Section4introducesthealgo-rithmDBSCANwhichdiscoverssuchclustersinaspatialdatabase.Insection5,weperformedanexperimentalevalu-ationoftheeffectivenessandefficiencyofDBSCANusingsyntheticdataanddataoftheSEQUOIA2000benchmark.Section6concludeswithasummaryandsomedirectionsforfutureresearch.2.ClusteringAlgorithmsTherearetwobasictypesofclusteringalgorithms(Kaufman&Rousseeuw1990):partitioningandhierarchicalalgo-rithms.Partitioningalgorithmsconstructapartitionofada-tabaseDofnobjectsintoasetofkclusters.kisaninputpa-rameterforthesealgorithms,i.esomedomainknowledgeisrequiredwhichunfortunatelyisnotavailableformanyap-plications.ThepartitioningalgorithmtypicallystartswithaninitialpartitionofDandthenusesaniterativecontrolstrategytooptimizeanobjectivefunction.Eachclusterisrepresentedbythegravitycenterofthecluster(k-meansal-gorithms)orbyoneoftheobjectsoftheclusterlocatednearitscenter(k-medoidalgorithms).Consequently,partitioningalgorithmsuseatwo-stepprocedure.First,determinekrep-resentativesminimizingtheobjectivefunction.Second,as-signeachobjecttotheclusterwithitsrepresentative“clos-est”totheconsideredobject.Thesecondstepimpliesthatapartitionisequivalenttoavoronoidiagramandeachclusteriscontainedinoneofthevoronoicells.Thus,theshapeofallADensity-BasedAlgorithmforDiscoveringClustersinLargeSpatialDatabaseswithNoiseMartinEster,Hans-PeterKriegel,JörgSander,XiaoweiXuInstituteforComputerScience,UniversityofMunichOettingenstr.67,D-80538München,Germany{ester|kriegel|sander|xwxu}@informatik.uni-muenchen.dePublishedinProceedingsof2ndInternationalConferenceonKnowledgeDiscoveryandDataMining(KDD-96)clustersfoundbyapartitioningalgorithmisconvexwhichisveryrestrictive.Ng&Han(1994)explorepartitioningalgorithmsforKDDinspatialdatabases.AnalgorithmcalledCLARANS(ClusteringLargeApplicationsbasedonRANdomizedSearch)isintroducedwhichisanimprovedk-medoidmeth-od.Comparedtoformerk-medoidalgorithms,CLARANSismoreeffectiveandmoreefficient.Anexperimentalevalu-ationindicatesthatCLARANSrunsefficientlyondatabasesofthousandsofobjects.Ng&Han(1994)alsodiscussmeth-odstodeterminethe“natural”numberknatofclustersinadatabase.TheyproposetorunCLARANSonceforeachkfrom2ton.Foreachofthedi

1 / 6
下载文档,编辑使用

©2015-2020 m.777doc.com 三七文档.

备案号:鲁ICP备2024069028号-1 客服联系 QQ:2149211541

×
保存成功