大数据数据挖掘培训讲义3:概念,属性和实例

整理文档很辛苦,赏杯茶钱您下走!

免费阅读已结束,点击下载阅读编辑剩下 ...

阅读已结束,您可以下载文档离线阅读编辑

资源描述

Input:Concepts,Attributes,Instances2ModuleOutlineTerminologyWhat’saconcept?Classification,association,clustering,numericpredictionWhat’sinanexample?Relations,flatfiles,recursionWhat’sinanattribute?Nominal,ordinal,interval,ratioPreparingtheinputARFF,attributes,missingvalues,gettingtoknowdatawitten&eibe3TerminologyComponentsoftheinput:Concepts:kindsofthingsthatcanbelearnedAim:intelligibleandoperationalconceptdescriptionInstances:theindividual,independentexamplesofaconceptNote:morecomplicatedformsofinputarepossibleAttributes:measuringaspectsofaninstanceWewillfocusonnominalandnumericoneswitten&eibe4What’saconcept?DataMiningTasks(Stylesoflearning):Classificationlearning:predictingadiscreteclassAssociationlearning:detectingassociationsbetweenfeaturesClustering:groupingsimilarinstancesintoclustersNumericprediction:predictinganumericquantityConcept:thingtobelearnedConceptdescription:outputoflearningschemewitten&eibe5ClassificationlearningExampleproblems:attritionprediction,usingDNAdatafordiagnosis,weatherdatatopredictplay/notplayClassificationlearningissupervisedSchemeisbeingprovidedwithactualoutcomeOutcomeiscalledtheclassoftheexampleSuccesscanbemeasuredonfreshdataforwhichclasslabelsareknown(testdata)Inpracticesuccessisoftenmeasuredsubjectively6AssociationlearningExamples:supermarketbasketanalysis-whatitemsareboughttogether(e.g.milk+cereal,chips+salsa)Canbeappliedifnoclassisspecifiedandanykindofstructureisconsidered“interesting”Differencewithclassificationlearning:Canpredictanyattribute’svalue,notjusttheclass,andmorethanoneattribute’svalueatatimeHence:farmoreassociationrulesthanclassificationrulesThus:constraintsarenecessaryMinimumcoverageandminimumaccuracy7ClusteringExamples:customergroupingFindinggroupsofitemsthataresimilarClusteringisunsupervisedTheclassofanexampleisnotknownSuccessoftenmeasuredsubjectivelySepallengthSepalwidthPetallengthPetalwidthType15.13.51.40.2Irissetosa24.93.01.40.2Irissetosa…517.03.24.71.4Irisversicolor526.43.24.51.5Irisversicolor…1016.33.36.02.5Irisvirginica1025.82.75.11.9Irisvirginica…witten&eibe8NumericpredictionClassificationlearning,but“class”isnumericLearningissupervisedSchemeisbeingprovidedwithtargetvalueMeasuresuccessontestdataOutlookTemperatureHumidityWindyPlay-timeSunnyHotHighFalse5SunnyHotHighTrue0OvercastHotHighFalse55RainyMildNormalFalse40……………witten&eibe9What’sinanexample?Instance:specifictypeofexampleThingtobeclassified,associated,orclusteredIndividual,independentexampleoftargetconceptCharacterizedbyapredeterminedsetofattributesInputtolearningscheme:setofinstances/datasetRepresentedasasinglerelation/flatfileRatherrestrictedformofinputNorelationshipsbetweenobjectsMostcommonforminpracticaldataminingwitten&eibe10AfamilytreePeterMPeggyF=StevenMGrahamMPamFGraceFRayM=IanMPippaFBrianM=AnnaFNikkiFwitten&eibe11FamilytreerepresentedasatableNameGenderParent1parent2PeterMale??PeggyFemale??StevenMalePeterPeggyGrahamMalePeterPeggyPamFemalePeterPeggyIanMaleGraceRayPippaFemaleGraceRayBrianMaleGraceRayAnnaFemalePamIanNikkiFemalePamIanwitten&eibe12The“sister-of”relationFirstpersonSecondpersonSisterof?PeterPeggyNoPeterStevenNo………StevenPeterNoStevenGrahamNoStevenPamYes………IanPippaYes………AnnaNikkiYes………NikkiAnnayesFirstpersonSecondpersonSisterof?StevenPamYesGrahamPamYesIanPippaYesBrianPippaYesAnnaNikkiYesNikkiAnnaYesAlltherestNoClosed-worldassumptionwitten&eibe13AfullrepresentationinonetableFirstpersonSecondpersonSisterof?NameGenderParent1Parent2NameGenderParent1Parent2StevenMalePeterPeggyPamFemalePeterPeggyYesGrahamMalePeterPeggyPamFemalePeterPeggyYesIanMaleGraceRayPippaFemaleGraceRayYesBrianMaleGraceRayPippaFemaleGraceRayYesAnnaFemalePamIanNikkiFemalePamIanYesNikkiFemalePamIanAnnaFemalePamIanYesAlltherestNoIfsecondperson’sgender=femaleandfirstperson’sparent=secondperson’sparentthensister-of=yeswitten&eibe14GeneratingaflatfileProcessofflatteningafileiscalled“denormalization”SeveralrelationsarejoinedtogethertomakeonePossiblewithanyfinitesetoffiniterelationsProblematic:relationshipswithoutpre-specifiednumberofobjectsExample:conceptofnuclear-familyDenormalizationmayproducespuriousregularitiesthatreflectstructureofdatabaseExample:“supplier”predicts“supplieraddress”witten&eibe18What’sinanattribute?Eachinstanceisdescribedbyafixedpredefinedsetoffeatures,its“attributes”But:numberofattributesmayvaryinpracticePossiblesolution:“irrelevantvalue”flagRelatedproblem:existenceofanattributemaydependofvalueofanotheronePossibleattributetypes(“levelsofmeasurement”):Nominal,ordinal,intervalandratiowitten&eibe19NominalquantitiesValuesaredistinctsymbolsValuesthemselvesserveonlyaslabelsornamesNominalcomesfromtheLatinwordfornameExample:attribute“outlook”fromweatherdataValues:“sunny”,”overcast”,and“rainy”Norelationisimpliedamongnominalvalues(noorderingordistancemeasure)Onlyequalitytestscanbeperformedwitten&eibe20OrdinalquantitiesImposeorderonvaluesBut:nodistancebetweenvaluesdefinedExample:attribute“temperature”inweatherdataValues:“hot”“mild”“co

1 / 36
下载文档,编辑使用

©2015-2020 m.777doc.com 三七文档.

备案号:鲁ICP备2024069028号-1 客服联系 QQ:2149211541

×
保存成功