2019年8月1日星期四DataMining:ConceptsandTechniques1第7章:分类和预测Whatisclassification?Whatisprediction?IssuesregardingclassificationandpredictionClassificationbydecisiontreeinductionBayesianClassificationClassificationbyNeuralNetworksClassificationbySupportVectorMachines(SVM)ClassificationbasedonconceptsfromassociationruleminingOtherClassificationMethodsPredictionClassificationaccuracySummary2019年8月1日星期四DataMining:ConceptsandTechniques2Classification:predictscategoricalclasslabels(discreteornominal)classifiesdata(constructsamodel)basedonthetrainingsetandthevalues(classlabels)inaclassifyingattributeandusesitinclassifyingnewdataPrediction:modelscontinuous-valuedfunctions,i.e.,predictsunknownormissingvaluesTypicalApplicationscreditapprovaltargetmarketingmedicaldiagnosistreatmenteffectivenessanalysisClassificationvs.Prediction2019年8月1日星期四DataMining:ConceptsandTechniques3Classification—ATwo-StepProcessModelconstruction:describingasetofpredeterminedclassesEachtuple/sampleisassumedtobelongtoapredefinedclass,asdeterminedbytheclasslabelattributeThesetoftuplesusedformodelconstructionistrainingsetThemodelisrepresentedasclassificationrules,decisiontrees,ormathematicalformulaeModelusage:forclassifyingfutureorunknownobjectsEstimateaccuracyofthemodelTheknownlabeloftestsampleiscomparedwiththeclassifiedresultfromthemodelAccuracyrateisthepercentageoftestsetsamplesthatarecorrectlyclassifiedbythemodelTestsetisindependentoftrainingset,otherwiseover-fittingwilloccurIftheaccuracyisacceptable,usethemodeltoclassifydatatupleswhoseclasslabelsarenotknown2019年8月1日星期四DataMining:ConceptsandTechniques4ClassificationProcess(1):ModelConstructionTrainingDataNAMERANKYEARSTENUREDMikeAssistantProf3noMaryAssistantProf7yesBillProfessor2yesJimAssociateProf7yesDaveAssistantProf6noAnneAssociateProf3noClassificationAlgorithmsIFrank=‘professor’ORyears6THENtenured=‘yes’Classifier(Model)2019年8月1日星期四DataMining:ConceptsandTechniques5ClassificationProcess(2):UsetheModelinPredictionClassifierTestingDataNAMERANKYEARSTENUREDTomAssistantProf2noMerlisaAssociateProf7noGeorgeProfessor5yesJosephAssistantProf7yesUnseenData(Jeff,Professor,4)Tenured?2019年8月1日星期四DataMining:ConceptsandTechniques6Supervisedvs.UnsupervisedLearningSupervisedlearning(classification)Supervision:Thetrainingdata(observations,measurements,etc.)areaccompaniedbylabelsindicatingtheclassoftheobservationsNewdataisclassifiedbasedonthetrainingsetUnsupervisedlearning(clustering)TheclasslabelsoftrainingdataisunknownGivenasetofmeasurements,observations,etc.withtheaimofestablishingtheexistenceofclassesorclustersinthedata2019年8月1日星期四DataMining:ConceptsandTechniques7第7章:分类和预测Whatisclassification?Whatisprediction?IssuesregardingclassificationandpredictionClassificationbydecisiontreeinductionBayesianClassificationClassificationbyNeuralNetworksClassificationbySupportVectorMachines(SVM)ClassificationbasedonconceptsfromassociationruleminingOtherClassificationMethodsPredictionClassificationaccuracySummary2019年8月1日星期四DataMining:ConceptsandTechniques8IssuesRegardingClassificationandPrediction(1):DataPreparationDatacleaningPreprocessdatainordertoreducenoiseandhandlemissingvaluesRelevanceanalysis(featureselection)RemovetheirrelevantorredundantattributesDatatransformationGeneralizeand/ornormalizedata2019年8月1日星期四DataMining:ConceptsandTechniques9Issuesregardingclassificationandprediction(2):EvaluatingClassificationMethodsPredictiveaccuracySpeedandscalabilitytimetoconstructthemodeltimetousethemodelRobustnesshandlingnoiseandmissingvaluesScalabilityefficiencyindisk-residentdatabasesInterpretability:understandingandinsightprovidedbythemodelGoodnessofrulesdecisiontreesizecompactnessofclassificationrules2019年8月1日星期四DataMining:ConceptsandTechniques10第7章:分类和预测Whatisclassification?Whatisprediction?IssuesregardingclassificationandpredictionClassificationbydecisiontreeinductionBayesianClassificationClassificationbyNeuralNetworksClassificationbySupportVectorMachines(SVM)ClassificationbasedonconceptsfromassociationruleminingOtherClassificationMethodsPredictionClassificationaccuracySummary2019年8月1日星期四DataMining:ConceptsandTechniques11TrainingDatasetageincomestudentcredit_ratingbuys_computer=30highnofairno=30highnoexcellentno31…40highnofairyes40mediumnofairyes40lowyesfairyes40lowyesexcellentno31…40lowyesexcellentyes=30mediumnofairno=30lowyesfairyes40mediumyesfairyes=30mediumyesexcellentyes31…40mediumnoexcellentyes31…40highyesfairyes40mediumnoexcellentnoThisfollowsanexamplefromQuinlan’sID32019年8月1日星期四DataMining:ConceptsandTechniques12Output:ADecisionTreefor“buys_computer”age?overcaststudent?creditrating?noyesfairexcellent=3040nonoyesyesyes30..402019年8月1日星期四DataMining:ConceptsandTechniques13AlgorithmforDecisionTreeInductionBasicalgorithm(agreedyalgorithm)Treeisconstructedinatop-downrecursivedivide-and-conquermannerAtstart,allthetrainingexamplesareattherootAttributesarecategorical(ifcontinuous-valued,theyarediscretizedinadvance)Examplesarepartitionedrec