2015/10/22DataMining:ConceptsandTechniques1分类技术(一)尤新革电子信息与通信学院国家防伪工程技术研究中心youxg@mail.hust.edu.cn2015/10/22DataMining:ConceptsandTechniques2分类的定义给定一个记录的集合(训练集)每个记录包含一系列属性,其中一个属性是类标号找到一个类标号属性的模型,它是其他属性的函数目标:尽可能精确的预测出一个未知记录所属的类别通常采用测试集合来检验模型的精度2015/10/22DataMining:ConceptsandTechniques3分类的定义分类任务就是通过学习得到一个目标函数f,把每个属性集x映射到一个预先定义的类标号y上Y=response,outputX=(x1,…,xp)=predictor,input回归Regression:y∈|R分类Classification:y∈{c1,…ck}2015/10/22DataMining:ConceptsandTechniques4分类的定义利用训练集,希望建立一个模型,当有一个输入x的时候,可以预测一个输出y预测的精确理解哪些属性影响到输出,如何影响ˆ()fx2015/10/22DataMining:ConceptsandTechniques5分类的定义ApplyModelLearnModelTidAttrib1Attrib2Attrib3Class1YesLarge125KNo2NoMedium100KNo3NoSmall70KNo4YesMedium120KNo5NoLarge95KYes6NoMedium60KNo7YesLarge220KNo8NoSmall85KYes9NoMedium75KNo10NoSmall90KYes10TidAttrib1Attrib2Attrib3Class11NoSmall55K?12YesMedium80K?13YesLarge110K?14NoSmall95K?15NoLarge67K?102015/10/22DataMining:ConceptsandTechniques6分类的定义混淆矩阵预测的类类=1类=0实际的类类=1f11f10类=0f01f00准确率=错误率=110011100100ffffff100111100100ffffff2015/10/22DataMining:ConceptsandTechniques7决策树根结点内部结点叶结点2015/10/22DataMining:ConceptsandTechniques8决策树的例子TidRefundMaritalStatusTaxableIncomeCheat1YesSingle125KNo2NoMarried100KNo3NoSingle70KNo4YesMarried120KNo5NoDivorced95KYes6NoMarried60KNo7YesDivorced220KNo8NoSingle85KYes9NoMarried75KNo10NoSingle90KYes10RefundMarStTaxIncYESNONONOYesNoMarriedSingle,Divorced80K80KSplittingAttributes训练数据模型:决策树2015/10/22DataMining:ConceptsandTechniques9决策树的例子TidRefundMaritalStatusTaxableIncomeCheat1YesSingle125KNo2NoMarried100KNo3NoSingle70KNo4YesMarried120KNo5NoDivorced95KYes6NoMarried60KNo7YesDivorced220KNo8NoSingle85KYes9NoMarried75KNo10NoSingle90KYes10MarStRefundTaxIncYESNONONOYesNoMarriedSingle,Divorced80K80KTherecouldbemorethanonetreethatfitsthesamedata!对于同一个数据集,可能会产生多个不同的树2015/10/22DataMining:ConceptsandTechniques10决策树ApplyModelLearnModelTidAttrib1Attrib2Attrib3Class1YesLarge125KNo2NoMedium100KNo3NoSmall70KNo4YesMedium120KNo5NoLarge95KYes6NoMedium60KNo7YesLarge220KNo8NoSmall85KYes9NoMedium75KNo10NoSmall90KYes10TidAttrib1Attrib2Attrib3Class11NoSmall55K?12YesMedium80K?13YesLarge110K?14NoSmall95K?15NoLarge67K?10DecisionTree2015/10/22DataMining:ConceptsandTechniques11决策树RefundMarStTaxIncYESNONONOYesNoMarriedSingle,Divorced80K80KRefundMaritalStatusTaxableIncomeCheatNoMarried80K?10TestDataStartfromtherootoftree.2015/10/22DataMining:ConceptsandTechniques12决策树RefundMarStTaxIncYESNONONOYesNoMarriedSingle,Divorced80K80KRefundMaritalStatusTaxableIncomeCheatNoMarried80K?10TestData2015/10/22DataMining:ConceptsandTechniques13决策树RefundMarStTaxIncYESNONONOYesNoMarriedSingle,Divorced80K80KRefundMaritalStatusTaxableIncomeCheatNoMarried80K?10TestData2015/10/22DataMining:ConceptsandTechniques14决策树RefundMarStTaxIncYESNONONOYesNoMarriedSingle,Divorced80K80KRefundMaritalStatusTaxableIncomeCheatNoMarried80K?10TestData2015/10/22DataMining:ConceptsandTechniques15决策树RefundMarStTaxIncYESNONONOYesNoMarriedSingle,Divorced80K80KRefundMaritalStatusTaxableIncomeCheatNoMarried80K?10TestData2015/10/22DataMining:ConceptsandTechniques16决策树RefundMarStTaxIncYESNONONOYesNoMarriedSingle,Divorced80K80KRefundMaritalStatusTaxableIncomeCheatNoMarried80K?10TestDataAssignCheatto“No”2015/10/22DataMining:ConceptsandTechniques17决策树ApplyModelLearnModelTidAttrib1Attrib2Attrib3Class1YesLarge125KNo2NoMedium100KNo3NoSmall70KNo4YesMedium120KNo5NoLarge95KYes6NoMedium60KNo7YesLarge220KNo8NoSmall85KYes9NoMedium75KNo10NoSmall90KYes10TidAttrib1Attrib2Attrib3Class11NoSmall55K?12YesMedium80K?13YesLarge110K?14NoSmall95K?15NoLarge67K?10DecisionTree2015/10/22DataMining:ConceptsandTechniques18Hunt算法LetDtbethesetoftrainingrecordsthatreachanodetGeneralProcedure:IfDtcontainsrecordsthatbelongthesameclassyt,thentisaleafnodelabeledasytIfDtisanemptyset,thentisaleafnodelabeledbythedefaultclass,ydIfDtcontainsrecordsthatbelongtomorethanoneclass,useanattributetesttosplitthedataintosmallersubsets.Recursivelyapplytheproceduretoeachsubset.TidRefundMaritalStatusTaxableIncomeCheat1YesSingle125KNo2NoMarried100KNo3NoSingle70KNo4YesMarried120KNo5NoDivorced95KYes6NoMarried60KNo7YesDivorced220KNo8NoSingle85KYes9NoMarried75KNo10NoSingle90KYes10Dt?2015/10/22DataMining:ConceptsandTechniques19Hunt算法Don’tCheatRefundDon’tCheatDon’tCheatYesNoRefundDon’tCheatYesNoMaritalStatusDon’tCheatCheatSingle,DivorcedMarriedTaxableIncomeDon’tCheat80K=80KRefundDon’tCheatYesNoMaritalStatusDon’tCheatCheatSingle,DivorcedMarriedTidRefundMaritalStatusTaxableIncomeCheat1YesSingle125KNo2NoMarried100KNo3NoSingle70KNo4YesMarried120KNo5NoDivorced95KYes6NoMarried60KNo7YesDivorced220KNo8NoSingle85KYes9NoMarried75KNo10NoSingle90KYes102015/10/22DataMining:ConceptsandTechniques20Hunt算法特殊情况处理:1.若第二步创建的子女结点为空,即不存在与这些结点相关联的记录。这时该结点成为叶结点,类标号为其父结点上训练记录中的多数类2.第二步中,若所有记录都具有相同的属性值(除目标属性外),则该结点为叶结点,标号为与该结点相关联的记录中多数类2015/10/22DataMining:ConceptsandTechniques21决策树归纳贪心策略选择昀优化当前结点所包含记录集的属性测试条件分割记录如何分裂训练记录?如何表示属性测试条件如何确定昀佳分割如何停止分裂过程?2015/10/22DataMining:ConceptsandTechniques22决策树归纳贪心策略选择昀优化当前结点所包含记录集的属性测试条件分割记录如