第八章标准规范、工具和发展趋势2

整理文档很辛苦,赏杯茶钱您下走!

免费阅读已结束,点击下载阅读编辑剩下 ...

阅读已结束,您可以下载文档离线阅读编辑

资源描述

数据挖掘第八章:标准规范、工具和发展趋势本章内容8.1数据挖掘标准与规范8.2数据挖掘工具8.3数据挖掘的研究趋势基本要求:了解数据挖掘在应用中的相关标准规范及未来的研究趋势。8.1数据挖掘标准与规范数据挖掘过程模型是确保数据挖掘工作顺利进行的关键。典型的过程模型有:SPSS的5A模型——评估(Assess)、访问(Access)、分析(Analyze)、行动(Act)、自动化(Automate)SAS的SEMMA模型——采样(Sample)、探索(Explore)、修正(Modify)、建模(Model)、评估(Assess)跨行业数据挖掘过程标准CRISP-DM(CrossIndustryStandardProcessforDataMining)。TwoCrows公司的数据挖掘过程模型,它与正在建立的CRISP-DM有许多相似之处。数据挖掘相关标准CRISP-DM(交叉行业数据挖掘过程标准,CrossIndustryStandardProcessforDataMining)。SPSS、NCR以及DaimlerChrysler三个在数据挖掘领域经验丰富的公司发起建立一个社团,目的建立数据挖掘方法和过程的标准DataBusinessUnderstandingDataUnderstandingDataPreparationModelingEvaluationDeployment8.1数据挖掘标准与规范Crisp-DMProjectObjectivesDataUnderstandingDataPreparationModelingEvaluationReportingBackgroundRequirements,assumptions,constraintsTerminologyDatamininggoals&successcriteriaProjectplanInitialDatacollectionreportDatadescriptionreportDataExplorationreportDataqualityreportDatadescriptionreportDatapre-processingstepsModelingassumptionTestdesignModeldescriptionModelassessment(inc.validation)AssessmentofdataminingresultswithrespecttoobjectivesFinalreport:-Summary:ObjectivesDataMiningprocessDataMiningresultsDataMiningassessment-Conclusions-Futurework(BusinessUnderstanding)(Deployment)•WidelyacceptedPROCESSMODELfordatamining•Providesaframeworkfordescribingthemodelingprocessindetail•“BESTPRACTICE”BusinessUnderstandingPhaseUnderstandthebusinessobjectivesWhatisthestatusquo?UnderstandbusinessprocessesAssociatedcosts/painDefinethesuccesscriteriaDevelopaglossaryofterms:speakthelanguageCost/BenefitAnalysisCurrentSystemsAssessmentIdentifythekeyactorsMinimum:TheSponsorandtheKeyUserWhatformsshouldtheoutputtake?IntegrationofoutputwithexistingtechnologylandscapeUnderstandmarketnormsandstandards8.1数据挖掘标准与规范BusinessUnderstandingPhaseTaskDecompositionBreakdowntheobjectiveintosub-tasksMapsub-taskstodataminingproblemdefinitionsIdentifyConstraintsResourcesLawe.g.DataProtectionBuildaprojectplanListassumptionsandrisk(technical/financial/business/organisational)factors8.1数据挖掘标准与规范DataUnderstandingPhaseCollectDataWhatarethedatasources?InternalandExternalSources(e.g.Axiom,Experian)Documentreasonsforinclusion/exclusionsDependonadomainexpertAccessibilityissuesArethereissuesregardingdatadistributionacrossdifferentdatabases/legacysystemsWherearethedisconnects?8.1数据挖掘标准与规范DataUnderstandingPhaseDataDescriptionDocumentdataqualityissuesComputebasicstatisticsDataExplorationSimpleunivariatedataplots/distributionsInvestigateattributeinteractionsDataQualityIssuesMissingValues:UnderstanditssourceStrangeDistributions8.1数据挖掘标准与规范DataPreparationPhaseIntegrateDataJoiningmultipledatatablesSummarisation/aggregationofdataSelectDataAttributesubsetselectionRationaleforInclusion/ExclusionDatasamplingTraining/ValidationandTestsets8.1数据挖掘标准与规范DataPreparationPhaseDataTransformationUsingfunctionssuchaslogFactor/PrincipalComponentsanalysisNormalization/Discretisation/BinarisationCleanDataHandlingmissingvalues/OutliersDataConstructionDerivedAttributes8.1数据挖掘标准与规范TheModelingPhaseBuildModelChooseinitialparametersettingsStudymodelbehaviour:SensitivityanalysisAssessthemodelBewareofover-fittingInvestigatetheerrordistribution:IdentifysegmentsofthestatespacewherethemodelislesseffectiveIterativelyadjustparametersettings8.1数据挖掘标准与规范TheEvaluationPhaseValidateModelHumanevaluationofresultsbydomainexpertsEvaluateusefulnessofresultsfrombusinessperspectiveDefinecontrolgroupsCalculateliftcurvesExpectedReturnonInvestmentReviewProcessDeterminenextstepsPotentialfordeploymentDeploymentarchitectureMetricsforsuccessofdeployment8.1数据挖掘标准与规范PMML(预测模型标记语言,PredictiveModelMarkupLanguage)。数据挖掘应用往往需要多种类型的数据挖掘软件、算法协同运行,这就要求对挖掘出的模型能够很好地继承、复用与集成。DMG(TheDataMiningGroup,DMG)提出PMML语言。PMML最新版本为4.1,支持16种数据挖掘模型,包括:AssociationModel(关联规则)、BaselineModel(基准模型)、ClusteringModel(聚类模型)、GeneralRegressionModel(回归模型)、MiningModel(组合模型)、NaiveBayesModel(朴素贝叶斯)、NearestNeighborModel(最近邻模型)NeuralNetwork(神经网络)、RegressionModel(线性、多项式、对数三种回归模型)、RuleSetModel(规则集)、SequenceModel(序列模式)、Scorecard、TimeSeriesModel、SupportVectorMachineModel(支持向量机)、TextModel(文本模型)、TreeModel(决策树)8.1数据挖掘标准与规范PMML的模型定义由以下几部分组成:8.1数据挖掘标准与规范TheheaderelementcontainsgeneralinformationaboutthePMMLdocument,suchascopyrightformationforthemodel,itsdescription,andinformationabouttheapplicationusedtogeneratethemodelsuchasnameandversion.8.1数据挖掘标准与规范PMMLversion=3.2...Headercopyright=Copyright(c)2009Togawaredescription=RPartDecisionTreeExtensionname=timestampvalue=2009-02-1506:51:50extender=Rattle/Extensionname=descriptionvalue=iristreeextender=Rattle/Applicationname=Rattle/PMMLversion=1.2.7//HeaderThedatadictionaryrecordsinformationaboutthedatafieldsfromwhichthemodelwasbuilt.8.1数据挖掘标准与规范DataDictionarynumberOfFields=5DataFieldname=Species...Valuevalue=setosa/Valuevalue=versicolor/Valuevalue=virginica/DataFieldname=Sepal.Lengthoptype=continuousdataType=double//DataFieldDataTransformations:transformationsallowforthemappingofuserdataintoamoredesirableformtob

1 / 44
下载文档,编辑使用

©2015-2020 m.777doc.com 三七文档.

备案号:鲁ICP备2024069028号-1 客服联系 QQ:2149211541

×
保存成功