数据挖掘业务理解数据理解数据准备建模评估部署挖掘模型数据挖掘引擎预测需求数据挖掘引擎预测结果训练数据挖掘模型挖掘模型CubeHistoricalDatasetNewDatasetDataTransform(ETL)ReportingModelBrowsingPredictionLOBApplicationMiningModelsAnalysisServicesOLAP&DataMiningIntegrationServicesSQLServerRelationalEngineReportingServicesManagementToolsDevToolsVisualStudio.NetExcelOWCMapPointDataAnalyzerBalancedScoreCardSharePointPortalServerWindowsServerWindowsClientCREATEMININGMODELCreditRisk(CustIDLONGKEY,GenderTEXTDISCRETE,IncomeLONGCONTINUOUS,ProfessionTEXTDISCRETE,RiskTEXTDISCRETEPREDICT)USINGMicrosoft_Decision_TreesINSERTINTOCreditRisk(CustId,Gender,Income,Profession,Risk)SelectCustomerID,Gender,Income,Profession,RiskFromCustomersSelectNewCustomers.CustomerID,CreditRisk.Risk,PredictProbability(CreditRisk)FROMCreditRiskPREDICTIONJOINNewCustomersONCreditRisk.Gender=NewCustomer.GenderANDCreditRisk.Income=NewCustomer.IncomeANDCreditRisk.Profession=NewCustomer.Profession决策树聚类时间序列序列聚类关联Naïve贝叶斯神经网络逻辑回归线性回归文本挖掘•已知–性别–年龄–交通距离–收入–汽车数目–子女数目–客户类型(”好”、”坏”)•预测–潜在客户•贝叶斯(NaiveBayes)•决策树(DecisionTrees)•神经网络(NeuralNetworks)•聚类(Clustering)•……好客户55%Y45%N3512030256055504540234567年龄月薪(千元)决策树原理:谁是我们的好客户?好客户55%Y45%N好客户73%Y27%N好客户33%Y67%N351203025605550454023456735+35-月薪(千元)年龄年龄决策树原理:谁是我们的好客户?好客户55%Y45%N好客户87%Y13%N好客户33%Y67%N好客户17%Y83%N好客户67%Y33%N好客户73%Y27%N好客户33%Y67%N3525年龄月薪35+35-5-5+2+2-月薪(千元)年龄决策树原理:谁是我们的好客户?•贝叶斯(NaiveBayes)、神经网络(NeuralNetworks)、聚类(Clustering)……•更多参数可以设置……•挑战:如何判断哪个算法更适合?•LiftChart•ProfitChart•ClassificationMatrixSELECTFLATTENEDt.[CustomerKey],[TMDecisionTree].[BikeBuyer],(PredictProbability([TMDecisionTree].[BikeBuyer]))as[Prob]From[TMDecisionTree]PREDICTIONJOIN@InputRowsetAStON[TMDecisionTree].[MaritalStatus]=t.[MaritalStatus]AND[TMDecisionTree].[Gender]=t.[Gender]AND[TMDecisionTree].[YearlyIncome]=t.[YearlyIncome]AND[TMDecisionTree].[TotalChildren]=t.[TotalChildren]AND[TMDecisionTree].[NumberChildrenAtHome]=t.[NumberChildrenAtHome]AND[TMDecisionTree].[HouseOwnerFlag]=t.[HouseOwnerFlag]AND[TMDecisionTree].[NumberCarsOwned]=t.[NumberCarsOwned]AND[TMDecisionTree].[CommuteDistance]=t.[CommuteDistance]AND[TMDecisionTree].[Region]=t.[Region]AND[TMDecisionTree].[Age]=t.[Age]•依据过去预测未来•具有一定时间周期性的业务场景•Microsoft时序算法提供了一些针对连续值(例如一段时间内的产品销售额)预测进行了优化的回归算法。•影碟商店案例•会员制影碟商店•会员调查•“谁”买了“什么电影”•在历史数据中,快速找出产品之间的关联规则•可以处理海量数据•规则包括–一对一(AB的概率)–多对一(A,BC的概率)•找出经常同时出现的项集•画出关联网络CustIDGenderMaritalStatusEducationHomeOwnership980001MaleMarriedBachelorsRent980002MaleMarriedBachelorsOwn980003FemaleSingleMastersOwn980004MaleSingleSomeCollegeOwn980005FemaleMarriedBachelorsRent980006FemaleMarriedMastersRentCustIDMovie980001LordoftheRings980001Matrix980002StarTrek980002Terminator980002StarWars980003E.T980004StarWars980004SixthSense980004ABeautifulMind980005Hours980005Signs980006MoulinRouge980006DieHard980006ApocalypseNowCustIDGenderMaritalStatusEducationHomeOwnership980001MaleMarriedBachelorsRent980002MaleMarriedBachelorsOwn980003FemaleSingleMastersOwn980004MaleSingleSomeCollegeOwn980005FemaleMarriedBachelorsRent980006FemaleMarriedMastersRentLordoftheRingsMatrixStarTrekTerminatorStarWarsE.TStarWarsSixthSenseABeautifulMindHoursSignsMoulinRougeDieHardApocalypseNowMoviesAdomdConnectionconn=newAdomdConnection(DataSource=localhost\\sql2005;Catalog=MovieSample;IntegratedSecurity=SSPI);conn.Open();AdomdCommandcmd=conn.CreateCommand();cmd.CommandText=generateDMX();AdomdDataReaderdr=cmd.ExecuteReader();while(dr.Read()){suggestListBox.Items.Add(dr.GetString(0));}conn.Close();