数据挖掘导论Pang-ningTan,MichaelStieinbach,andVipinKumar著PearsonEducationLTD.范明等译人民邮电出版社第4章分类:基本概念、决策树与模型评估引言:预备知识,解决分类问题的一般方法决策树归纳模型的过分拟合评估分类器的性能引言2020年4月2日星期四数据挖掘导论4分类:定义Givenacollectionofrecords(trainingset)Eachrecordcontainsasetofattributes,oneoftheattributesistheclass.Findamodelforclassattributeasafunctionofthevaluesofotherattributes.Goal:previouslyunseenrecordsshouldbeassignedaclassasaccuratelyaspossible.Atestsetisusedtodeterminetheaccuracyofthemodel.Usually,thegivendatasetisdividedintotrainingandtestsets,withtrainingsetusedtobuildthemodelandtestsetusedtovalidateit2020年4月2日星期四数据挖掘导论5分类:解释ApplyModelInductionDeductionLearnModelModelTidAttrib1Attrib2Attrib3Class1YesLarge125KNo2NoMedium100KNo3NoSmall70KNo4YesMedium120KNo5NoLarge95KYes6NoMedium60KNo7YesLarge220KNo8NoSmall85KYes9NoMedium75KNo10NoSmall90KYes10TidAttrib1Attrib2Attrib3Class11NoSmall55K?12YesMedium80K?13YesLarge110K?14NoSmall95K?15NoLarge67K?10TestSetLearningalgorithmTrainingSet2020年4月2日星期四数据挖掘导论6分类任务的例子肿瘤:Predictingtumorcellsasbenignormalignant信用卡交易:Classifyingcreditcardtransactionsaslegitimateorfraudulent蛋白质结构:Classifyingsecondarystructuresofproteinasalpha-helix,beta-sheet,orrandomcoil新闻:Categorizingnewsstoriesasfinance,weather,entertainment,sports,etc2020年4月2日星期四数据挖掘导论7分类:技术DecisionTreebasedMethodsRule-basedMethodsMemorybasedreasoningNeuralNetworksNaïveBayesandBayesianBeliefNetworksSupportVectorMachines4.3决策树归纳2020年4月2日星期四数据挖掘导论92020年4月2日星期四数据挖掘导论10决策树:例子TidRefundMaritalStatusTaxableIncomeCheat1YesSingle125KNo2NoMarried100KNo3NoSingle70KNo4YesMarried120KNo5NoDivorced95KYes6NoMarried60KNo7YesDivorced220KNo8NoSingle85KYes9NoMarried75KNo10NoSingle90KYes10SplittingAttributesTrainingDataModel:DecisionTreeYESNONONOYesNoMarriedSingle,Divorced80K80KRefundMarStTaxInc2020年4月2日星期四数据挖掘导论11决策树树中包含三种结点根结点(rootnode):没有入边,有零条或多条出边内部结点(internalnode):恰有一条入边和两条或多条出边叶结点(leafnode)或终端结点(terminalnode):恰有一条入边,但没有出边2020年4月2日星期四数据挖掘导论12决策树分类任务:应用模型ApplyModelInductionDeductionLearnModelModelTidAttrib1Attrib2Attrib3Class1YesLarge125KNo2NoMedium100KNo3NoSmall70KNo4YesMedium120KNo5NoLarge95KYes6NoMedium60KNo7YesLarge220KNo8NoSmall85KYes9NoMedium75KNo10NoSmall90KYes10TidAttrib1Attrib2Attrib3Class11NoSmall55K?12YesMedium80K?13YesLarge110K?14NoSmall95K?15NoLarge67K?10TestSetTreeInductionalgorithmTrainingSetDecisionTree2020年4月2日星期四数据挖掘导论13决策树:使用模型RefundMaritalStatusTaxableIncomeCheatNoMarried80K?10TestDataStartfromtherootoftree.YESNONONOYesNoMarriedSingle,Divorced80K80KRefundMarStTaxInc2020年4月2日星期四数据挖掘导论14决策树:使用模型RefundMaritalStatusTaxableIncomeCheatNoMarried80K?10TestDataYESNONONOYesNoMarriedSingle,Divorced80K80KRefundMarStTaxInc2020年4月2日星期四数据挖掘导论15决策树:使用模型TestDataYESNONONOYesNoMarriedSingle,Divorced80K80KRefundMarStTaxIncRefundMaritalStatusTaxableIncomeCheatNoMarried80K?102020年4月2日星期四数据挖掘导论16决策树:使用模型TestDataYESNONONOYesNoMarriedSingle,Divorced80K80KRefundMarStTaxIncRefundMaritalStatusTaxableIncomeCheatNoMarried80K?102020年4月2日星期四数据挖掘导论17决策树:使用模型TestDataYESNONONOYesNoMarriedSingle,Divorced80K80KRefundMarStTaxIncRefundMaritalStatusTaxableIncomeCheatNoMarried80K?102020年4月2日星期四数据挖掘导论18决策树:使用模型TestDataYESNONONOYesNoMarriedSingle,Divorced80K80KRefundMarStTaxIncRefundMaritalStatusTaxableIncomeCheatNoMarried80K?10AssignCheatto“No”2020年4月2日星期四数据挖掘导论19决策树分类任务:学习模型ApplyModelInductionDeductionLearnModelModelTidAttrib1Attrib2Attrib3Class1YesLarge125KNo2NoMedium100KNo3NoSmall70KNo4YesMedium120KNo5NoLarge95KYes6NoMedium60KNo7YesLarge220KNo8NoSmall85KYes9NoMedium75KNo10NoSmall90KYes10TidAttrib1Attrib2Attrib3Class11NoSmall55K?12YesMedium80K?13YesLarge110K?14NoSmall95K?15NoLarge67K?10TestSetTreeInductionalgorithmTrainingSetDecisionTree2020年4月2日星期四数据挖掘导论20决策树归纳ManyAlgorithms:Hunt’sAlgorithm(oneoftheearliest)CARTID3,C4.5SLIQ,SPRINT2020年4月2日星期四数据挖掘导论21Hunt算法的一般结构LetDtbethesetoftrainingrecordsthatreachanodetGeneralProcedure:IfDtcontainsrecordsthatbelongthesameclassyt,thentisaleafnodelabeledasytIfDtisanemptyset,thentisaleafnodelabeledbythedefaultclass,ydIfDtcontainsrecordsthatbelongtomorethanoneclass,useanattributetesttosplitthedataintosmallersubsets.Recursivelyapplytheproceduretoeachsubset.2020年4月2日星期四数据挖掘导论22Hunt算法:例Don’tCheatRefundDon’tCheatDon’tCheatYesNoRefundDon’tCheatYesNoMaritalStatusDon’tCheatCheatSingle,DivorcedMarriedTaxableIncomeDon’tCheat80K=80KRefundDon’tCheatYesNoMaritalStatusDon’tCheatCheatSingle,DivorcedMarriedTidRefundMaritalStatusTaxableIncomeCheat1YesSingle125KNo2NoMarried100KNo3NoSingle70KNo4YesMarried120KNo5NoDivorced95KYes6NoMarried60KNo7YesDivorced220KNo8NoSingle85KYes9NoMarried75KNo10NoSingle90KYes102020年4月2日星期四数据挖掘导论23决策树归纳Greedystrategy.Splittherecordsbasedonanattributetestthatoptimizescertaincriterion.IssuesDeterminehowtosplittherecordsHowtospecifytheattributetestcondition?Howtodeterminethebestsplit?DeterminewhentostopsplittingDiscussabovethreeissuesindetails2020年4月2日星期四数据挖掘导论24如何指定属性测试条件DependsonattributetypesNominalOrdinalContinuousDependsonnumberofwaystosplit2-waysplitMulti-waysplit2020年4月2日星期四数据挖掘导论25划分:标称属性Mu