厦门大学软件学院1第八讲分类与预测厦门大学软件学院2ClassificationandPredictionWhatisclassification?Whatisprediction?IssuesregardingclassificationandpredictionClassificationbydecisiontreeinductionBayesianclassificationRule-basedclassification(自学)ClassificationbybackpropagationSupportVectorMachines(SVM)(自学)Associativeclassification(自学)Lazylearners(orlearningfromyourneighbors)(自学)Otherclassificationmethods(自学)PredictionAccuracyanderrormeasures(自学)Ensemblemethods(自学)Modelselection(自学)Summary厦门大学软件学院3分类(Classification):预测分类标号(离散值或名词性词)建立一个模型,基于训练集的数据和分类属性的值(类标识)来分类,并在新数据中使用该模型。预测(Prediction):连续值函数上的建模,例如,预测未知或缺失的值典型应用信用度目标市场医疗诊断分类vs.预测厦门大学软件学院4分类的两个步骤模型创建:描述预定的数据类集或概念集。假定每个元组或样本属于一个预定义的类,由一个称为类标号属性(classlabelattribute)的属性确定。用于建模的元组集称为训练集(trainingset)模型表示为:分类规则、判定树或属性公式模型应用:用于分类未来或未知对象评估模型的准确率测试样本的已知标号与根据模型获得的分类结果作比较。准确率定义为正确被模型分类的测试样本的百分比测试集独立于训练集,否则,学习模型倾向于过分适合(over-fitting)数据如果准确率可被接受,使用模型用于分类类标号未知的数据。厦门大学软件学院5ClassificationProcess(1):ModelConstructionTrainingDataNAMERANKYEARSTENUREDMikeAssistantProf3noMaryAssistantProf7yesBillProfessor2yesJimAssociateProf7yesDaveAssistantProf6noAnneAssociateProf3noClassificationAlgorithmsIFrank=‘professor’ORyears6THENtenured=‘yes’Classifier(Model)厦门大学软件学院6ClassificationProcess(2):UsetheModelinPredictionClassifierTestingDataNAMERANKYEARSTENUREDTomAssistantProf2noMerlisaAssociateProf7noGeorgeProfessor5yesJosephAssistantProf7yesUnseenData(Jeff,Professor,4)Tenured?厦门大学软件学院7决策树算法基本算法(贪心算法),树的构建方式:自顶向下递归的各个击破方式树以代表训练样本的单个节点开始如果样本都在同一类中,则节点为叶子节点,并用该类标记否则,选择能够最好地将样本分类的属性(称为测试属性,必须是离散值的)对测试属性的每个已知值,创建一个分支,并据此划分样本递归形成每个划分上的样本决策树厦门大学软件学院8递归划分步骤仅当下列条件之一成立时立即停止给定节点的所有样本属于同一个类没有剩余属性可作进一步划分样本——majorityvotingisemployedforclassifyingtheleaf没有剩余的样本age?student?creditrating?=3040noyesyesyes31..40fairexcellentyesno厦门大学软件学院9AttributeSelectionMeasure:InformationGain(ID3/C4.5)选择具有最高信息增益(informationgain)的属性。pi是D中任意元组属于类Ci的概率,用估计|Ci,D|/|D|Expectedinformation(熵:entropy)对D中元组分类所需的期望信息:Informationneeded(afterusingAtosplitDintovpartitions)toclassifyD:InformationgainedbybranchingonattributeA)(log)(21imiippDInfo)(||||)(1jvjjADIDDDInfo(D)InfoInfo(D)Gain(A)A厦门大学软件学院10AttributeSelection:InformationGainClassP:buys_computer=“yes”ClassN:buys_computer=“no”means“age=30”has5outof14samples,with2yes’esand3no’s.HenceSimilarly,agepiniI(pi,ni)=30230.97131…4040040320.971694.0)2,3(145)0,4(144)3,2(145)(IIIDInfoage048.0)_(151.0)(029.0)(ratingcreditGainstudentGainincomeGain246.0)()()(DInfoDInfoageGainageageincomestudentcredit_ratingbuys_computer=30highnofairno=30highnoexcellentno31…40highnofairyes40mediumnofairyes40lowyesfairyes40lowyesexcellentno31…40lowyesexcellentyes=30mediumnofairno=30lowyesfairyes40mediumyesfairyes=30mediumyesexcellentyes31…40mediumnoexcellentyes31…40highyesfairyes40mediumnoexcellentno)3,2(145I940.0)145(log145)149(log149)5,9()(22IDInfo厦门大学软件学院11ComputingInformation-GainforContinuous-ValueAttributesLetattributeAbeacontinuous-valuedattributeMustdeterminethebestsplitpointforASortthevalueAinincreasingorderTypically,themidpointbetweeneachpairofadjacentvaluesisconsideredasapossiblesplitpoint(ai+ai+1)/2isthemidpointbetweenthevaluesofaiandai+1ThepointwiththeminimumexpectedinformationrequirementforAisselectedasthesplit-pointforASplit:D1isthesetoftuplesinDsatisfyingA≤split-point,andD2isthesetoftuplesinDsatisfyingAsplit-point厦门大学软件学院12GainRatioforAttributeSelection(C4.5)信息增益度量偏向具有许多输出的测试C4.5(asuccessorofID3)usesgainratiotoovercometheproblem(normalizationtoinformationgain)GainRatio(A)=Gain(A)/SplitInfo(A)Ex.gain_ratio(income)=0.029/0.926=0.031Theattributewiththemaximumgainratioisselectedasthesplittingattribute)||||(log||||)(21DDDDDSplitInfojvjjA926.0)144(log144)146(log146)144(log144)(222DSplitInfoA厦门大学软件学院13Gini指标(CART,IBMIntelligentMiner)IfadatasetDcontainsexamplesfromnclasses,giniindex,gini(D)isdefinedaswherepjistherelativefrequencyofclassjinDIfadatasetDissplitonAintotwosubsetsD1andD2,theginiindexgini(D)isdefinedas不纯度降低:Theattributeprovidesthesmallestginisplit(D)(orthelargestreductioninimpurity)ischosentosplitthenode(needtoenumerateallthepossiblesplittingpointsforeachattribute)njpjDgini11)(2)(||||)(||||)(2211DginiDDDginiDDDginiA)()()(DginiDginiAginiA厦门大学软件学院14Giniindex(CART,IBMIntelligentMiner)Ex.Dhas9tuplesinbuys_computer=“yes”and5in“no”SupposetheattributeincomepartitionsDinto10inD1:{low,medium}and4inD2butgini{medium,high}is0.30andthusthebestsinceitisthelowestAllattributesareassumedcontinuous-valuedMayneedothertools,e.g.,clustering,togetthepossiblesplitvaluesCanbemodifiedforcategoricalattributes459.01451491)(22Dgini)(144)(1410)(11},{DGiniDGiniDginimediumlowincome厦门大学软件学院15ComparingAttributeSelectionMeasuresThethreemeasures,ingeneral,returngoodresultsbutInformationgain:biasedtowardsmultivaluedattributesGainratio:tendstopreferunbalancedsplitsinwhichonepartitionismuchsmallerthantheothersGiniindex:biasedtomultivaluedattributeshasdifficultywhen#ofclassesislargetendst