1本科毕业设计论文课题名称基于JAVA的WEKA数据挖掘平台分析及二次开发学生姓名林莉莉学号20032311专业名称计算机科学与技术指导教师姓名陈慧萍申请学位级别工学学士学位授予单位河海大学论文提交日期2007年6月计算机及信息工程学院(常州)2河海大学本科毕业设计(论文)任务书(理工科类)Ⅰ、毕业设计(论文)题目:基于JAVA的WEKA数据挖掘平台分析及二次开发Ⅱ、毕业设计(论文)工作内容(从综合运用知识、研究方案的设计、研究方法和手段的运用、应用文献资料、数据分析处理、图纸质量、技术或观点创新等方面详细说明):数据挖掘是目前计算机科学中活跃的研究领域之一,所谓数据挖掘就是采用机器学习算法从大量数据中提取和挖掘知识,因此广泛用于智能数据分析和处理中。WEKA是基于java的数据挖掘平台,其中集合了大量能承担数据挖掘任务的机器学习算法,包括对数据进行预处理,分类,聚类,关联规则,属性选择以及在新的交互式界面上的可视化。由于其源码的开放性,WEKA不仅可以用于完成常规的数据挖掘任务,也可以用于数据挖掘的二次开发中。本课题属研究性课题,要求学生阅读大量资料,自学数据挖掘方面的知识,分析WEKA数据挖掘的平台,写出全面的文献综述。并综合利用数据结构、算法设计与分析、JAVA语言等知识,进行基于WEKA平台的二次开发。具体任务如下:①阅读国内外文献,了解数据挖掘技术的基本方法与应用;对数据挖掘的方法之一如分类或聚类算法作更深入的了解。②WEKA数据挖掘平台的分析:阅读WEKA数据挖掘平台的大量文档,分析其实现机理,了解WEKA进行数据挖掘的基本过程。结合①和②写出WEKA数据挖掘工具的文献综述。③WEKA平台的数据挖掘实验:分析WEKA的数据挖掘过程,分析WEKA所要求的数据集的格式和WEKAExplorer的功能模块,并准备典型的数据集,在WEKA平台上做大量数据挖掘测试实验,并分析其实现机理及存在问题。④研究WEKA开放源码,利用其提供的类,进行二次开发,实现数据挖掘的一个典型算法。3Ⅲ、进度安排:第1—2周:确定设计任务。第3—4周:阅读相关文献,外文翻译。第5—7周:写出WEKA数据挖掘工具的文献综述;第8—12周:WEKA数据挖掘平台的挖掘实验;第13—15周:WEKA数据挖掘平台上的二次开发(使用JAVA);第16周:写毕业论文第17周:资料整理、程序打包、准备答辩。Ⅳ、主要参考资料:①(美)JiaweiHan,MichelineKamber.数据挖掘:概念与技术[M],北京:机械工业出版社,2001.②IanH.Witten.DataMining:PracticalMachineLearningToolsandTechniques(SecondEdition)[M],北京:机械工业出版社,2005.③(美)米哈尔斯基.机器学习与数据挖掘[M],北京:电子工业出版社,2004④WEKATutorial.MachineLearningAlgorithmsinJava指导教师:陈慧萍,2007年3月1日学生姓名:林莉莉,专业年级:计算机03系负责人审核意见(从选题是否符合专业培养目标、是否结合科研或工程实际、综合训练程度、内容难度及工作量等方面加以审核):系负责人:,年月日4摘要数据挖掘是在“信息爆炸,知识缺乏”的背景下提出的新技术。所谓数据挖掘就是从大量的、不完整的、有噪声的、模糊的、随机的数据中,提取隐含在其中的、人们事先不知道的、但又是潜在有用的信息和知识的过程。该技术在银行业、市场业、零售业、保险业及电信业等诸多领域的数据分析中有着广阔的应用前景。本文首先针对数据挖掘技术作了比较全面的综述,并深入分析聚类方法。其次,针对学术界典型的开放数据挖掘工具WEKA,进行数据挖掘测试,主要包括预处理、分类、聚类、属性选择、关联规则及可视化等,并对挖掘结果进行统计分析,指出WEKA系统存在的缺陷及发展前景。为了弥补WEKA系统存在的一些缺陷,本文还在WEKA平台下进行二次开发,根据描述的k-中心点轮换法的算法流程,利用eclipse在WEKA平台下嵌入该算法,并对其进行优化以提高其聚类效果。虽然本文研究的WEKA数据挖掘工具目前还处于研究阶段,但它却汇集了多样化的机器学习算法,是数据挖掘研究的理想选择。同时,本文所研究的k-中心点轮换算法改进了传统的k-中心点算法,避免陷入局部昀优,并进行了属性正常化、处理残缺值等优化,聚类效果明显提高了。关键词:数据挖掘 WEKA聚类分析k-中心点轮换算法5AbstractDataMiningisanewtechnologywhichisputforwardwiththebackgroundofdatarichbutknowledgepoor.Generally,DataMiningistheprocessofextractingtheconnotative,unknownbutpotentiallyusefuldataandknowledgefromthedatathatisplentiful,incomplete,noisy,fuzzyandstochastic.Thetechnologyhasawidestapplicationforegroundinthedataanalysisondozensoffieldssuchasbanking,marketing,retailing,insurance,telecomandsoon.First,thepapermakesacomprehensivesummarizationforthedataminingtechnologyandanalyzestheclusteringmethodsindepth.Second,thepaperdoessometestsaboutdataminingonWEKAwhichisatypicalandopendataminingtoolintheacademe.Thetestsmainlyincludepreprocessing、classifying、clustering、associating、selectingattributesandvisualization.Moreover,thepaperstatisticallyanalyzesthetestresultsandindicatesthefaultsoftheWEKAsystemanditsdevelopmentforeground.Last,inordertosupplyagapfortheWEKAsystem,thepaperalsomakessecondarydevelopmentontheWEKAplatformaccordingtothek-medoidssubstitutionmethod’sflowchartbyusingeclipseIDE,andthenoptimizesthisalgorithmtoimprovetheclusteringeffect.AlthoughthedataminingtoolnamedWEKAthatbeinginvestigatedcurrentlyisonitsresearchphase,butitintegratesvariousmachinelearningmethodsandsoit’sreallyaperfectchoicefordataminingresearch.Atthesametime,thek-medoidssubstitutionmethodimprovesonthetraditionalk-medoidsmethod,preventingitfromgettingintopartialoptimumsolution.Andthepaperalsomakessomeoptimizationssuchasattributesnormalizing,defaultvalueprocessingandsoon,withwhattheclusteringeffecthasbeenimprovedalot.KeyWords:DataMiningWEKAClusterAnalysisK-medoidsSubstitutionMethod6目录1前言···························································································································································71.1课题背景·······································································································································71.2本文所做的主要工作····················································································································71.3本文结构·······································································································································82数据挖掘技术综述···································································································································92.1数据挖掘的定义····························································································································92.2数据挖掘的基本功能····················································································································92.3数据挖掘的流程··························································································································102.4数据挖掘的常用方法和技术······································································································112.5数据挖掘的应用领域··················································································································122.6国内外数据挖掘工具现状··········································································································132.7聚类分析概述························································