I摘要数据挖掘(DM)是当前涉及统计学、人工智能、数据库等学科的热门的研究领域,是从数据中提取人们感兴趣的、潜在的、可用的知识,并表示成用户可理解的形式。分类是数据挖掘的一个重要分支,分类能找出描述数据类或概念的模型,以便能使用模型预测类标记未知的对象类。最早的决策树算法是由Hunt等人于1966年提出的CLS。当前最有影响的决策树算法是Quinlan于1986年提出的ID3和1993年提出的C4.5。ID3只能处理离散型描述属性,它选择信息增益最大的属性划分训练样本,其目的是进行分枝时系统的熵最小,从而提高算法的运算速度和精确度。ID3算法的主要缺陷是,用信息增益作为选择分枝属性的标准时,偏向于取值较多的属性,而在某些情况下,这类属性可能不会提供太多有价值的信息。C4.5是ID3算法的改进算法,不仅可以处理离散型描述属性,还能处理连续性描述属性。C4.5采用了信息增益比作为选择分枝属性的标准,弥补了ID3算法的不足。本文研究的是基于决策树的分类技术。运用了C4.5算法将一组数据进行分类并生成决策树,首先对数据进行处理,利用归纳算法生成可读的规则和决策树,然后使用决策对新数据进行分析。关键词:数据挖掘;分类技术;决策树;C4.5西北师范大学数信学院本科论文IIAbstractDatamining(DM)isrelevanttostatistics,artificialintelligence,databaseandotherdisciplineshotresearchfield,isextractedfromthedataofinterest,potential,theavailableknowledge,andunderstandableform.Classificationisanimportantbranchofdatamining,classificationcanfindtodescribethedatatypeorconceptualmodel,soastousethemodeltopredicttheclasslabelunknownobjectclass.Theearliestdecision-makingalgorithmsisCLS-1966,byHuntetal.ThemostinfluentialdecisiontreealgorithmisID3proposedbyQuinlanin1986and1993,theC4.5.ID3canhandleonlyadiscretedescriptionofproperty,itchoosestheinformationtogainthegreatestattributedividedtrainingsamples,thepurposeiscarriedbranchingentropyofthesystem,therebyimprovingthecomputationalspeedandaccuracyofthealgorithm.ThemajordrawbackoftheID3algorithm,informationgainasthechoiceofbranchespropertiesofthestandard,biasedinfavorofthemorethevalueoftheproperty,andinsomecases,thesepropertiesmaynotprovidemuchvaluableinformation.C4.5istheID3algorithm,theimprovedalgorithmcanhandlenotonlythediscretedescriptionofproperty,canhandlecontinuousdescriptionoftheproperty.C4.5usesinformationgainratioasthestandardtoselectthebranchingproperty,tomakeupforthelackofID3algorithm.Ofthisstudyisbasedondecisiontreeclassificationtechniques.UseofasetofdataclassificationandgeneratesadecisiontreealgorithmC4.5,thefirstdataprocessing,theuseofrulesanddecisiontreeinductionalgorithmtogeneratereadable,andthenusethedecision-makingtoanalyzethenewdata.Keywords:Datamining;classification;decisiontree;C4.5西北师范大学数信学院本科论文III目录第一章绪论......................................11.1研究背景及意义................................................................................................................11.2国内外研究现状................................................................................................................21.2.1国外研究现状.........................................................................................................21.2.2国内研究现状.........................................................................................................3第二章文献综述................................42.1数据挖掘发展简述...........................................................................................................42.2数据挖掘基本知识...........................................................................................................52.3数据挖掘功能...................................................................................................................82.3.1概念描述:定性与对比........................................................................................82.3.2关联分析................................................................................................................92.3.3分类与预测............................................................................................................92.3.4聚类分析............................................................................................................102.3.5异类分析............................................................................................................102.3.6演化分析............................................................................................................10第三章决策树算法.............................123.1决策树的定义...............................................................................................................123.2决策树的优点...............................................................................................................123.3决策树结构图...............................................................................................................123.4决策树的建立...............................................................................................................133.5树剪枝...........................................................................................................................153.6生成分类规则...............................................................................................................16第四章用C4.5算法实现weather数据的分类......184.1ID3算法的缺点...........................................................................................................184.2C4.5算法做出的改进...............................................................................................184.2.1用信息增益率来选择属性................................................................................184.2.2可以处理连续数值型属性................................................................................194.2.3采用了一种后剪枝方法.......................................................................