第九讲分类与预测.

win12
1 ℃
2019-12-19

整理文档很辛苦，赏杯茶钱您下走！

还剩 ... 页未读，继续阅读 >>

免费阅读已结束，点击下载阅读编辑剩下 ... 页

阅读已结束，您可以下载文档离线阅读编辑

资源描述

厦门大学软件学院1第八讲分类与预测厦门大学软件学院2ClassificationandPredictionWhatisclassification?Whatisprediction?IssuesregardingclassificationandpredictionClassificationbydecisiontreeinductionBayesianclassificationRule-basedclassification（自学）ClassificationbybackpropagationSupportVectorMachines(SVM)（自学）Associativeclassification（自学）Lazylearners(orlearningfromyourneighbors)（自学）Otherclassificationmethods（自学）PredictionAccuracyanderrormeasures（自学）Ensemblemethods（自学）Modelselection（自学）Summary厦门大学软件学院3分类(Classification):预测分类标号(离散值或名词性词)建立一个模型，基于训练集的数据和分类属性的值（类标识）来分类，并在新数据中使用该模型。预测(Prediction):连续值函数上的建模,例如，预测未知或缺失的值典型应用信用度目标市场医疗诊断分类vs.预测厦门大学软件学院4分类的两个步骤模型创建:描述预定的数据类集或概念集。假定每个元组或样本属于一个预定义的类，由一个称为类标号属性（classlabelattribute）的属性确定。用于建模的元组集称为训练集(trainingset)模型表示为：分类规则、判定树或属性公式模型应用:用于分类未来或未知对象评估模型的准确率测试样本的已知标号与根据模型获得的分类结果作比较。准确率定义为正确被模型分类的测试样本的百分比测试集独立于训练集，否则，学习模型倾向于过分适合(over-fitting)数据如果准确率可被接受，使用模型用于分类类标号未知的数据。厦门大学软件学院5ClassificationProcess(1):ModelConstructionTrainingDataNAMERANKYEARSTENUREDMikeAssistantProf3noMaryAssistantProf7yesBillProfessor2yesJimAssociateProf7yesDaveAssistantProf6noAnneAssociateProf3noClassificationAlgorithmsIFrank=‘professor’ORyears6THENtenured=‘yes’Classifier(Model)厦门大学软件学院6ClassificationProcess(2):UsetheModelinPredictionClassifierTestingDataNAMERANKYEARSTENUREDTomAssistantProf2noMerlisaAssociateProf7noGeorgeProfessor5yesJosephAssistantProf7yesUnseenData(Jeff,Professor,4)Tenured?厦门大学软件学院7决策树算法基本算法(贪心算法)，树的构建方式：自顶向下递归的各个击破方式树以代表训练样本的单个节点开始如果样本都在同一类中，则节点为叶子节点，并用该类标记否则，选择能够最好地将样本分类的属性(称为测试属性，必须是离散值的)对测试属性的每个已知值，创建一个分支，并据此划分样本递归形成每个划分上的样本决策树厦门大学软件学院8递归划分步骤仅当下列条件之一成立时立即停止给定节点的所有样本属于同一个类没有剩余属性可作进一步划分样本——majorityvotingisemployedforclassifyingtheleaf没有剩余的样本age?student?creditrating?=3040noyesyesyes31..40fairexcellentyesno厦门大学软件学院9AttributeSelectionMeasure:InformationGain(ID3/C4.5)选择具有最高信息增益（informationgain）的属性。pi是D中任意元组属于类Ci的概率，用估计|Ci,D|/|D|Expectedinformation(熵：entropy)对D中元组分类所需的期望信息:Informationneeded(afterusingAtosplitDintovpartitions)toclassifyD:InformationgainedbybranchingonattributeA)(log)(21imiippDInfo)(||||)(1jvjjADIDDDInfo(D)InfoInfo(D)Gain(A)A厦门大学软件学院10AttributeSelection:InformationGainClassP:buys_computer=“yes”ClassN:buys_computer=“no”means“age=30”has5outof14samples,with2yes’esand3no’s.HenceSimilarly,agepiniI(pi,ni)=30230.97131…4040040320.971694.0)2,3(145)0,4(144)3,2(145)(IIIDInfoage048.0)_(151.0)(029.0)(ratingcreditGainstudentGainincomeGain246.0)()()(DInfoDInfoageGainageageincomestudentcredit_ratingbuys_computer=30highnofairno=30highnoexcellentno31…40highnofairyes40mediumnofairyes40lowyesfairyes40lowyesexcellentno31…40lowyesexcellentyes=30mediumnofairno=30lowyesfairyes40mediumyesfairyes=30mediumyesexcellentyes31…40mediumnoexcellentyes31…40highyesfairyes40mediumnoexcellentno)3,2(145I940.0)145(log145)149(log149)5,9()(22IDInfo厦门大学软件学院11ComputingInformation-GainforContinuous-ValueAttributesLetattributeAbeacontinuous-valuedattributeMustdeterminethebestsplitpointforASortthevalueAinincreasingorderTypically,themidpointbetweeneachpairofadjacentvaluesisconsideredasapossiblesplitpoint(ai+ai+1)/2isthemidpointbetweenthevaluesofaiandai+1ThepointwiththeminimumexpectedinformationrequirementforAisselectedasthesplit-pointforASplit:D1isthesetoftuplesinDsatisfyingA≤split-point,andD2isthesetoftuplesinDsatisfyingAsplit-point厦门大学软件学院12GainRatioforAttributeSelection(C4.5)信息增益度量偏向具有许多输出的测试C4.5(asuccessorofID3)usesgainratiotoovercometheproblem(normalizationtoinformationgain)GainRatio(A)=Gain(A)/SplitInfo(A)Ex.gain_ratio(income)=0.029/0.926=0.031Theattributewiththemaximumgainratioisselectedasthesplittingattribute)||||(log||||)(21DDDDDSplitInfojvjjA926.0)144(log144)146(log146)144(log144)(222DSplitInfoA厦门大学软件学院13Gini指标(CART,IBMIntelligentMiner)IfadatasetDcontainsexamplesfromnclasses,giniindex,gini(D)isdefinedaswherepjistherelativefrequencyofclassjinDIfadatasetDissplitonAintotwosubsetsD1andD2,theginiindexgini(D)isdefinedas不纯度降低:Theattributeprovidesthesmallestginisplit(D)(orthelargestreductioninimpurity)ischosentosplitthenode(needtoenumerateallthepossiblesplittingpointsforeachattribute)njpjDgini11)(2)(||||)(||||)(2211DginiDDDginiDDDginiA)()()(DginiDginiAginiA厦门大学软件学院14Giniindex(CART,IBMIntelligentMiner)Ex.Dhas9tuplesinbuys_computer=“yes”and5in“no”SupposetheattributeincomepartitionsDinto10inD1:{low,medium}and4inD2butgini{medium,high}is0.30andthusthebestsinceitisthelowestAllattributesareassumedcontinuous-valuedMayneedothertools,e.g.,clustering,togetthepossiblesplitvaluesCanbemodifiedforcategoricalattributes459.01451491)(22Dgini)(144)(1410)(11},{DGiniDGiniDginimediumlowincome厦门大学软件学院15ComparingAttributeSelectionMeasuresThethreemeasures,ingeneral,returngoodresultsbutInformationgain:biasedtowardsmultivaluedattributesGainratio:tendstopreferunbalancedsplitsinwhichonepartitionismuchsmallerthantheothersGiniindex:biasedtomultivaluedattributeshasdifficultywhen#ofclassesislargetendst