电气信息工程学院外文翻译英文名称:Datamining-clustering译文名称:数据挖掘—聚类分析专业:自动化姓名:****班级学号:****指导教师:******译文出处:Datamining:IanH.Witten,EibeFrank著二○一○年四月二十六日Clustering5.1INTRODUCTIONClusteringissimilartoclassificationinthatdataaregrouped.However,unlikeclassification,thegroupsarenotpredefined.Instead,thegroupingisaccomplishedbyfindingsimilaritiesbetweendataaccordingtocharacteristicsfoundintheactualdata.Thegroupsarecalledclusters.Someauthorsviewclusteringasaspecialtypeofclassification.Inthistext,however,wefollowamoreconventionalviewinthatthetwoaredifferent.Manydefinitionsforclustershavebeenproposed:Setoflikeelements.Elementsfromdifferentclustersarenotalike.Thedistancebetweenpointsinaclusterislessthanthedistancebetweenapointintheclusterandanypointoutsideit.Atermsimilartoclusteringisdatabasesegmentation,whereliketuple(record)inadatabasearegroupedtogether.Thisisdonetopartitionorsegmentthedatabaseintocomponentsthatthengivetheuseramoregeneralviewofthedata.Inthiscasetext,wedonotdifferentiatebetweensegmentationandclustering.AsimpleexampleofclusteringisfoundinExample5.1.Thisexampleillustratesthefactthatthatdetermininghowtodotheclusteringisnotstraightforward.AsillustratedinFigure5.1,agivensetofdatamaybeclusteredondifferentattributes.Hereagroupofhomesinageographicareaisshown.Thefirstfloortypeofclusteringisbasedonthelocationofthehome.Homesthataregeographicallyclosetoeachotherareclusteredtogether.Inthesecondclustering,homesaregroupedbasedonthesizeofthehouse.Clusteringhasbeenusedinmanyapplicationdomains,includingbiology,medicine,anthropology,marketing,andeconomics.Clusteringapplicationsincludeplantandanimalclassification,diseaseclassification,imageprocessing,patternrecognition,anddocumentretrieval.Oneofthefirstdomainsinwhichclusteringwasusedwasbiologicaltaxonomy.RecentusesincludeexaminingWeblogdatatodetectusagepatterns.Whenclusteringisappliedtoareal-worlddatabase,manyinterestingproblemsoccur:Outlierhandlingisdifficult.Heretheelementsdonotnaturallyfallintoanycluster.Theycanbeviewedassolitaryclusters.However,ifaclusteringalgorithmattemptstofindlargerclusters,theseoutlierswillbeforcedtobeplacedinsomecluster.Thisprocessmayresultinthecreationofpoorclustersbycombiningtwoexistingclustersandleavingtheoutlierinitsowncluster.Dynamicdatainthedatabaseimpliesthatclustermembershipmaychangeovertime.Interpretingthesemanticmeaningofeachclustermaybedifficult.Withclassification,thelabelingoftheclassesisknownaheadoftime.However,withclustering,thismaynotbethecase.Thus,whentheclusteringprocessfinishescreatingasetofclusters,theexactmeaningofeachclustermaynotbeobvious.Hereiswhereadomainexpertisneededtoassignalabelorinterpretationforeachcluster.Thereisnoonecorrectanswertoaclusteringproblem.Infact,manyanswersmaybefound.Theexactnumberofclustersrequiredisnoteasytodetermine.Again,adomainexpertmayberequired.Forexample,supposewehaveasetofdataaboutplantsthathavebeencollectedduringafieldtrip.Withoutanypriorknowledgeofplantclassification,ifweattempttodividethissetofdataintosimilargroupings,itwouldnotbeclearhowmanygroupsshouldbecreated.Anotherrelatedissueiswhatdatashouldbeusedofclustering.Unlikelearningduringaclassificationprocess,wherethereissomeaprioriknowledgeconcerningwhattheattributesofeachclassificationshouldbe,inclusteringwehavenosupervisedlearningtoaidtheprocess.Indeed,clusteringcanbeviewedassimilartounsupervisedlearning.Wecanthensummarizesomebasicfeaturesofclustering(asopposedtoclassification):The(best)numberofclustersisnotknown.Theremaynotbeanyaprioriknowledgeconcerningtheclusters.Clusterresultsaredynamic.TheclusteringproblemisstatedasshowninDefinition5.1.Hereweassumethatthenumberofclusterstobecreatedisaninputvalue,k.Theactualcontent(andinterpretation)ofeachcluster,jk,1jk,isdeterminedasaresultofthefunctiondefinition.Withoutlossofgenerality,wewillviewthattheresultofsolvingaclusteringproblemisthatasetofclustersiscreated:K={12,,...,kkkk}.DEFINITION5.1.GivenadatabaseD={12,,...,nttt}oftuplesandanintegervaluek,theclusteringproblemistodefineamappingf:{1,...,}DkwhereeachitisassignedtooneclusterjK,1jk.AclusterjK,containspreciselythosetuplesmappedtoit;thatis,jK={|(),1,iijtftKinanditD}.AclassificationofthedifferenttypesofclusteringalgorithmsisshowninFigure5.2.Clusteringalgorithmsthemselvesmaybeviewedashierarchicalorpartitional.Withhierarchicalclustering,anestedsetofclustersiscreated.Eachlevelinthehierarchyhasaseparatesetofclusters.Atthelowestlevel,eachitemisinitsownuniquecluster.Atthehighestlevel,allitemsbelongtothesamecluster.Withhierarchicalclustering,thedesirednumberofclustersisnotinput.Withpartitionalclustering,thealgorithmcreatesonlyonesetofclusters.Theseapproachesusethedesirednumberofclusterstodrivehowthefinalsetiscreated.Traditionalclusteringalgorithmstendtobetargetedtosmallnumericdatabasethatfitintomemory.Thereare,however,morerecentclusteringalgorithmsthatlookatcategoricaldataandaretargetedtolarger,perhapsdynamic,databases.Algorithmstargetedtolargerdatabasesmayadapttomemoryconstraintsbyeithersamplingthedatabaseorusingdatastructures,whichcanbecompressedo