哈尔滨工业大学数据挖掘理论与算法实验报告(2015年度秋季学期)课程编码S1300019C授课教师邹兆年学生姓名谢浩哲学号15S103172学院计算机科学与技术学院哈尔滨工业大学Page1of10Designedby谢浩哲NOTE:本报告所涉及的全部代码均已在GitHub上开源:一、实验内容NOTE:各算法的实现思想将在下一节阐述.1.K-MeansK-meansclusteringisamethodofvectorquantization,originallyfromsignalprocessing,thatispopularforclusteranalysisindatamining.k-meansclusteringaimstopartitionnobservationsintokclustersinwhicheachobservationbelongstotheclusterwiththenearestmean,servingasaprototypeofthecluster.2.AGNES(层次聚类)AGNES,knownasAgglomerativeHierarchicalclustering.Thisalgorithmworksbygroupingthedataonebyoneonthebasisofthenearestdistancemeasureofallthepairwisedistancebetweenthedatapoint.Againdistancebetweenthedatapointisrecalculatedbutwhichdistancetoconsiderwhenthegroupshasbeenformed?Forthistherearemanyavailablemethods.Someofthemare:-Single-nearestdistanceorsinglelinkage-Complete-farthestdistanceorcompletelinkage-Average-averagedistanceoraveragelinkage-Centroiddistance-Ward'smethod-sumofsquaredEuclideandistanceisminimized3.DBSCANDensity-basedspatialclusteringofapplicationswithnoise(DBSCAN)isadataclusteringalgorithm.Itisadensity-basedclusteringalgorithm:givenasetofpointsinsomespace,itgroupstogetherpointsthatarecloselypackedtogether(pointswithmanynearbyneighbors),markingasoutliers'pointsthatliealoneinlow-densityregions(whosenearestneighborsaretoofaraway).DBSCANisoneofthemostcommonclusteringalgorithmsandalsomostcitedinscientificliterature.二、实验设计1.K-Means算法思想:任意选取点集中的k个点作为中心,对每一个点与k个中心进行对比,划分至以这k个中心为中心点的簇中.划分结束后,重新计算每一个簇的中心点.重复以上过程,直至这些中心点不再变化.哈尔滨工业大学Page2of10Designedby谢浩哲程序流程图:核心代码:1publicclassKMeans{2publicCluster[]getClusters(intk,Point[]points){3if(k=0||k=points.length){4returnnull;5}67Cluster[]clusters=getInitialClusters(k,points);8Cluster[]newClusters=null;9do{10newClusters=getClusters(k,points,clusters);1112if(isClustersTheSame(clusters,newClusters)){13break;哈尔滨工业大学Page3of10Designedby谢浩哲14}15clusters=newClusters;16}while(true);17returnclusters;18}1920privateCluster[]getClusters(intk,Point[]points,Cluster[]cluster){21for(inti=0;ipoints.length;++i){22PointcurrentPoint=points[i];23Clusterc=getClosestClusters(currentPoint,cluster);24c.points.add(currentPoint);25}2627Cluster[]newClusters=newCluster[k];28for(inti=0;ik;++i){29Clusterc=cluster[i];30intnumberOfPointsInCluster=c.points.size();3132if(numberOfPointsInCluster==0){33//Iftheclusterisempty34intrandomIndex=(int)(Math.random()*points.length);35newClusters[i]=newCluster(points[randomIndex]);36}else{37//Iftheclusterisnotempty38doublenewCentroidX=0;39doublenewCentroidY=0;40for(intj=0;jnumberOfPointsInCluster;++j){41Pointp=c.points.get(j);42newCentroidX+=p.x;43newCentroidY+=p.y;44}45newCentroidX/=numberOfPointsInCluster;46newCentroidY/=numberOfPointsInCluster;48ClusternewCluster=newCluster(newPoint(newCentroidX,newCentroidY));49newClusters[i]=newCluster;50}51}哈尔滨工业大学Page4of10Designedby谢浩哲52returnnewClusters;53}54}2.AGNES(层次聚类)算法思想:算法选用GroupAverage作为合并估量.第一次循环选取n个点中GroupAverage最小值进行合并,将合并后的簇加入列表中,移除之前的2个簇,并重新计算该簇中的点与其他n–2个簇的GroupAverage.重复执行之前的步骤,直至所有的簇都被合并.程序流程图:哈尔滨工业大学Page5of10Designedby谢浩哲核心代码:1publicclassAgnes{2publicClustergetCluster(ListClusterclusters){3while(clusters.size()1){4doubleminProximity=Double.MAX_VALUE;5intminProximityIndex1=0,minProximityIndex2=0;67for(inti=0;iclusters.size();++i){8for(intj=i+1;jclusters.size();++j){9doubleproximity=getProximity(clusters.get(i),clusters.get(j));1011if(proximityminProximity){12minProximity=proximity;13minProximityIndex1=i;14minProximityIndex2=j;15}16}17}18Clusterc=newCluster(clusters.get(minProximityIndex1),clusters.get(minProximityIndex2));19clusters.add(c);20clusters.remove(minProximityIndex2);21clusters.remove(minProximityIndex1);22}23returnclusters.size()==0?null:clusters.get(0);24}25}3.DBSCAN算法思想:首先在所有的点集中识别出CorePoint(对其ε邻域内点的个数进行计数),再在剩余的点集中识别出CorePoint(即该点在CorePoint的ε邻域内).接着,若两个CorePoint彼此相连,他们是一个Cluster中的点,将所有的CorePoint合并成若干的Cluster.再检查所有的BorderPoint,看该BorderPoint在哪一个CorePoint的ε邻域内,将其合并至该CorePoint所在的簇.哈尔滨工业大学Page6of10Designedby谢浩哲程序流程图:核心代码:以下为该算法核心代码的实现(仅包含识别CorePoint,并将CorePoint分类成簇)1publicclassDbscan{2publicListClustergetClusters(ListPointpoints,intminPoints,doubleeps){3ListPointcorePoints=getCorePoints(points,minPoints,eps);4MapPoint,Clusterclusters=getClustersOfCorePoints(corePoints,eps);56ListPointborderPoints=getBorderPoints(points,corePoints,minPoints,eps);7getClustersOfBorderPoints(corePoints,borderPoints,clusters,eps);8哈尔滨工业大学Page7of10Designedby谢浩哲9returnnewArrayListCluster(clusters.values());10}1112privateListPointgetCorePoints(ListPointpoints,intminPoints,doubleeps){13ListPointcorePoints=newArrayListPoint();1415for(inti=0;ipoints.size();++i){16PointcurrentPoint=points.get(i);17intnumberOfPointsInEps=0;18for(intj=0;jpoints.size();++j){19PointanotherPoint=points.get(j);20if(currentPoint.isPointsInEpsCircle(anotherPoint,eps)){21++numberOfPointsInEps;22}23}24if(numberOfPointsInEps=minPoints){25currentPoint.pointType=PointType.CorePoint;26corePoints.add(currentPoint);27}28}29returncorePoi