聚类分析文献英文翻译

整理文档很辛苦,赏杯茶钱您下走!

免费阅读已结束,点击下载阅读编辑剩下 ...

阅读已结束,您可以下载文档离线阅读编辑

资源描述

电气信息工程学院外文翻译英文名称:Datamining-clustering译文名称:数据挖掘—聚类分析专业:自动化姓名:****班级学号:****指导教师:******译文出处:Datamining:IanH.Witten,EibeFrank著二○一○年四月二十六日Clustering5.1INTRODUCTIONClusteringissimilartoclassificationinthatdataaregrouped.However,unlikeclassification,thegroupsarenotpredefined.Instead,thegroupingisaccomplishedbyfindingsimilaritiesbetweendataaccordingtocharacteristicsfoundintheactualdata.Thegroupsarecalledclusters.Someauthorsviewclusteringasaspecialtypeofclassification.Inthistext,however,wefollowamoreconventionalviewinthatthetwoaredifferent.Manydefinitionsforclustershavebeenproposed:Setoflikeelements.Elementsfromdifferentclustersarenotalike.Thedistancebetweenpointsinaclusterislessthanthedistancebetweenapointintheclusterandanypointoutsideit.Atermsimilartoclusteringisdatabasesegmentation,whereliketuple(record)inadatabasearegroupedtogether.Thisisdonetopartitionorsegmentthedatabaseintocomponentsthatthengivetheuseramoregeneralviewofthedata.Inthiscasetext,wedonotdifferentiatebetweensegmentationandclustering.AsimpleexampleofclusteringisfoundinExample5.1.Thisexampleillustratesthefactthatthatdetermininghowtodotheclusteringisnotstraightforward.AsillustratedinFigure5.1,agivensetofdatamaybeclusteredondifferentattributes.Hereagroupofhomesinageographicareaisshown.Thefirstfloortypeofclusteringisbasedonthelocationofthehome.Homesthataregeographicallyclosetoeachotherareclusteredtogether.Inthesecondclustering,homesaregroupedbasedonthesizeofthehouse.Clusteringhasbeenusedinmanyapplicationdomains,includingbiology,medicine,anthropology,marketing,andeconomics.Clusteringapplicationsincludeplantandanimalclassification,diseaseclassification,imageprocessing,patternrecognition,anddocumentretrieval.Oneofthefirstdomainsinwhichclusteringwasusedwasbiologicaltaxonomy.RecentusesincludeexaminingWeblogdatatodetectusagepatterns.Whenclusteringisappliedtoareal-worlddatabase,manyinterestingproblemsoccur:Outlierhandlingisdifficult.Heretheelementsdonotnaturallyfallintoanycluster.Theycanbeviewedassolitaryclusters.However,ifaclusteringalgorithmattemptstofindlargerclusters,theseoutlierswillbeforcedtobeplacedinsomecluster.Thisprocessmayresultinthecreationofpoorclustersbycombiningtwoexistingclustersandleavingtheoutlierinitsowncluster.Dynamicdatainthedatabaseimpliesthatclustermembershipmaychangeovertime.Interpretingthesemanticmeaningofeachclustermaybedifficult.Withclassification,thelabelingoftheclassesisknownaheadoftime.However,withclustering,thismaynotbethecase.Thus,whentheclusteringprocessfinishescreatingasetofclusters,theexactmeaningofeachclustermaynotbeobvious.Hereiswhereadomainexpertisneededtoassignalabelorinterpretationforeachcluster.Thereisnoonecorrectanswertoaclusteringproblem.Infact,manyanswersmaybefound.Theexactnumberofclustersrequiredisnoteasytodetermine.Again,adomainexpertmayberequired.Forexample,supposewehaveasetofdataaboutplantsthathavebeencollectedduringafieldtrip.Withoutanypriorknowledgeofplantclassification,ifweattempttodividethissetofdataintosimilargroupings,itwouldnotbeclearhowmanygroupsshouldbecreated.Anotherrelatedissueiswhatdatashouldbeusedofclustering.Unlikelearningduringaclassificationprocess,wherethereissomeaprioriknowledgeconcerningwhattheattributesofeachclassificationshouldbe,inclusteringwehavenosupervisedlearningtoaidtheprocess.Indeed,clusteringcanbeviewedassimilartounsupervisedlearning.Wecanthensummarizesomebasicfeaturesofclustering(asopposedtoclassification):The(best)numberofclustersisnotknown.Theremaynotbeanyaprioriknowledgeconcerningtheclusters.Clusterresultsaredynamic.TheclusteringproblemisstatedasshowninDefinition5.1.Hereweassumethatthenumberofclusterstobecreatedisaninputvalue,k.Theactualcontent(andinterpretation)ofeachcluster,jk,1jk,isdeterminedasaresultofthefunctiondefinition.Withoutlossofgenerality,wewillviewthattheresultofsolvingaclusteringproblemisthatasetofclustersiscreated:K={12,,...,kkkk}.DEFINITION5.1.GivenadatabaseD={12,,...,nttt}oftuplesandanintegervaluek,theclusteringproblemistodefineamappingf:{1,...,}DkwhereeachitisassignedtooneclusterjK,1jk.AclusterjK,containspreciselythosetuplesmappedtoit;thatis,jK={|(),1,iijtftKinanditD}.AclassificationofthedifferenttypesofclusteringalgorithmsisshowninFigure5.2.Clusteringalgorithmsthemselvesmaybeviewedashierarchicalorpartitional.Withhierarchicalclustering,anestedsetofclustersiscreated.Eachlevelinthehierarchyhasaseparatesetofclusters.Atthelowestlevel,eachitemisinitsownuniquecluster.Atthehighestlevel,allitemsbelongtothesamecluster.Withhierarchicalclustering,thedesirednumberofclustersisnotinput.Withpartitionalclustering,thealgorithmcreatesonlyonesetofclusters.Theseapproachesusethedesirednumberofclusterstodrivehowthefinalsetiscreated.Traditionalclusteringalgorithmstendtobetargetedtosmallnumericdatabasethatfitintomemory.Thereare,however,morerecentclusteringalgorithmsthatlookatcategoricaldataandaretargetedtolarger,perhapsdynamic,databases.Algorithmstargetedtolargerdatabasesmayadapttomemoryconstraintsbyeithersamplingthedatabaseorusingdatastructures,whichcanbecompressedo

1 / 14
下载文档,编辑使用

©2015-2020 m.777doc.com 三七文档.

备案号:鲁ICP备2024069028号-1 客服联系 QQ:2149211541

×
保存成功