数据挖掘期末考试在线测试答案

dragonwzh
1 ℃
2020-03-07

整理文档很辛苦，赏杯茶钱您下走！

还剩 ... 页未读，继续阅读 >>

免费阅读已结束，点击下载阅读编辑剩下 ... 页

阅读已结束，您可以下载文档离线阅读编辑

资源描述

一个食品连锁店每周的事务记录如下表所示，其中每一条事务表示在一项收款机业务中卖出的项目，假定supmin=20%，confmin=40%，使用Apriori算法计算生成的关联规则，标明每趟数据库扫描时的候选集和大项目集。事务项目事务项目T1T2T3面包、果冻、花生酱面包、花生酱面包、牛奶、花生酱T4T5啤酒、面包啤酒、牛奶解：1)扫描数据库对每个候选计算支持2)比较候选支持度与最小支持度，得出频繁项集L13)由L1产生候选C2C2项集{面包，花生酱}{面包，牛奶}{面包，啤酒}{面包，果冻}{花生酱，牛奶}{花生酱，啤酒}{花生酱，果冻}{牛奶，啤酒}{牛奶，果冻}{啤酒，果冻}4）扫描，对每个候选计算支持度C1项集支持度{面包}{花生酱}{牛奶}{啤酒}{果冻}4/53/52/52/51/5L1项集支持度{面包}{花生酱}{牛奶}{啤酒}{果冻}4/53/52/52/51/5C2项集支持度{面包，花生酱}{面包，牛奶}{面包，啤酒}{面包，果冻}{花生酱，牛奶}{花生酱，啤酒}{花生酱，果冻}{牛奶，啤酒}{牛奶，果冻}{啤酒，果冻}3/51/51/51/51/501/51/5005)比较候选支持度与最小支持度，得出频繁项集L2L2项集支持度{面包，花生酱}{面包，牛奶}{面包，啤酒}{面包，果冻}{花生酱，牛奶}{花生酱，果冻}{牛奶，啤酒}3/51/51/51/51/51/51/56)由L2产生候选C3C3项集{面包，花生酱，牛奶}{面包，花生酱，啤酒}{面包，花生酱，果冻}{面包，牛奶，啤酒}{面包，牛奶，果冻}{面包，啤酒，果冻}{花生酱，牛奶，果冻}{花生酱，牛奶，啤酒}7）扫描，对每个候选计算支持度C3项集支持度{面包，花生酱，牛奶}{面包，花生酱，啤酒}{面包，花生酱，果冻}{面包，牛奶，啤酒}{面包，牛奶，果冻}{面包，啤酒，果冻}{花生酱，牛奶，果冻}{花生酱，牛奶，啤酒}1/501/5000008）比较候选支持度与最小支持度，得出频繁项集L3C3项集支持度{面包，花生酱，牛奶}{面包，花生酱，果冻}1/51/5下面计算关联规则：1{面包，花生酱，牛奶}的非空子集有{面包，花生酱}，{面包，牛奶}，{花生酱，牛奶}，{面包}，{花生酱}，{牛奶}{面包，花生酱}{牛奶}confidence=5/35/1=33.3%{面包，牛奶}{花生酱}confidence=5/15/1=100%{花生酱，牛奶}{面包}confidence=5/15/1=100%{面包}{花生酱，牛奶}confidence=5/45/1=25%{花生酱}{面包，牛奶}confidence=5/35/1=33.3%{牛奶}{面包，花生酱}confidence=5/25/1=50%故强关联规则有{面包，牛奶}{花生酱}，{花生酱，牛奶}{面包}，{牛奶}{面包，花生酱}2{面包，花生酱，果冻}的非空子集有{面包，花生酱}，{面包，果冻}，{花生酱，果冻}，{面包}，{花生酱}，{果冻}{面包，花生酱}{果冻}confidence=5/35/1=33.3%{面包，果冻}{花生酱}confidence=5/15/1=100%{花生酱，果冻}{面包}confidence=5/15/1=100%{面包}{花生酱，果冻}confidence=5/45/1=25%{花生酱}{面包，果冻}confidence=5/35/1=33.3%{果冻}{面包，花生酱}confidence5/15/1=100%故强关联规则有{面包，果冻}{花生酱}，{花生酱，果冻}{面包}，{果冻}{面包，花生酱}Thefollowingshowsahistoryofcustomerswiththeirincomes,agesandanattributecalled“Have_iPhone”indicatingwhethertheyhaveaniPhone.WealsoindicatewhethertheywillbuyaniPadornotinthelastcolumn.No.IncomeAgeHave_iPhoneBuy_iPad1highyoungyesyes2higholdyesyes3mediumyoungnoyes4higholdnoyes5mediumyoungnono6mediumyoungnono7mediumoldnono8mediumoldnono(a)WewanttotrainaCARTdecisiontreeclassifiertopredictwhetheranewcustomerwillbuyaniPadornot.WedefinethevalueofattributeBuy_iPadisthelabelofarecord.(i)PleasefindaCARTdecisiontreeaccordingtotheaboveexample.Inthedecisiontree,wheneverweprocessanodecontainingatmost3records,westoptoprocessthisnodeforsplitting.(ii)ConsideranewyoungcustomerwhoseincomeismediumandhehasaniPhone.PleasepredictwhetherthisnewcustomerwillbuyaniPadornot.(b)WhatisthedifferencebetweentheC4.5decisiontreeandtheID3decisiontree?Whyisthereadifference?解：解：a.(i)对于所给定样本的期望信息是：-84log284-84log284=1属性Income的样本:Info(high)=-3log21-0log20=0Info(medium)=-51log251-54log254=0.72193期望信息为：83×0+85×0.72193=0.27072信息增益为：Gain（Income）=1-E(Income)=0.729277同样计算知：Gain(Age)=0.09436Gain(Have_iPhone)=0.311这三个属性中Income的Gain最大，所以选择Income为最优特征，于是根节点生成两个子节点，一个是叶节点，对另一个节点继续使用以上方法，在A2，A3选择最优特征及其最优切分点，结果是Age。依此计算得，CART树为：YoungOldmediumYesAgeIncomeHighNONO（ii）这个新的年轻、中等收入、有IPhone的顾客，将不会购买IPad。（b）C4.5决策树算法和ID3算法相似，但是C4.5决策树算法是对ID3算法的改进，ID3算法在生成决策树的过程中，使用信息增益来进行特征选择，是选择信息增益最大的特征；C4.5算法在生成决策树的过程中，用信息增益比来选择特征，是选择信息增益比最大的特征。因为信息增益的大小是相对于训练数据集而言的，并没有绝对的意义，在分类困难时，也就是在训练数据集的经验熵大的时候，信息增益会偏大，反之，信息增益会偏小。使用信息增益比可以对这一问题进行校正。Considerthefollowingeighttwo-dimensionaldatapoints:x1:(23,12),x2:(6,6),x3:(15,0),x4:(15,28),x5:(20,9),x6:(8,9),x7:(20,11),x8:(8,13),Consideralgorithmk-means.Pleaseanswerthefollowingquestions.Youarerequiredtoshowtheinformationabouteachfinalcluster(includingthemeanoftheclusterandalldatapointsinthiscluster).Youcanconsiderwritingaprogramforthispartbutyouarenotrequiredtosubmittheprogram.(a)Ifk=2andtheinitialmeansare(20,9)and(8,9),whatistheoutputofthealgorithm?(b)Ifk=2andtheinitialmeansare(15,0)and(15,29),whatistheoutputofthealgorithm?解：(a)已知K=2，初始质心是(20,9)、(8,9)则：M1M2K1K2(20,9)(8,9)(20,9),(23,12),(15,0),(15,28),(20,11)(8,9),(6,6),(8,13)(18.6,12)(7.3,9.3)(23,12),(15,28),(20,9),(20,11)}(15,0),(6,6),(8,9),(8,13)(19.5,15)(9.5,7)(23,12),(15,28),(20,9),(20,11)(15,0),(6,6),(8,9),(8,13)所以，算法输出两个簇：K1={x1,x4,x5,x7}K2={x2,x3,x6,x8}（b）已知K=2，初始质心是(15,0)、(15,29)则：M1M2K1K2(15,0)(15,29)(23,12),(6,6),(15,0),(20,9),(8,9),(20,11),(8,13)(15,28)(14.3,8.6)(15,28)(23,12),(6,6),(15,0),(20,9),(8,9),(20,11),(8,13)(15,28)所以，算法输出两个簇：K1={x1,x2,x3,x5,x6,x7,x8}K2={x4}4.ConsidereightdatapointsThefollowingmatrixshowsthepairwisedistancesbetweenanytwopoints.1234567810211035130412214057171180613415520079151216151908112012211722300Pleaseusetheagglomerationapproachtoclustertheseeightpointsintotwogroups/clustersbyusingdistancecompletelinkage.Pleasewritedownalldatapointsforeachclusterandwritedownthedistancebetweenthetwoclusters.35距离1合并为簇（3，5）123456781021103513041221405717118061341552007915121615190811201221172230024距离2合并为簇（2，4）123,546781021103,551304122140613415507915121619081120122122300（2，4）6距离4合并为簇（2，4,6）12,43,5678102,41103,5513061341507915121908112012223001距离（3,5）为5合并为簇（1，3,5）12,4,63,578102,4,61103，5513079151208112012300（1,3,5）距离7为9合并为簇（1，3,5,7）1，3,52,4,6781,3,502，4,61107915081120300（1,3,5,7）距离8为11合并为簇（1,3,5,7，8）1，3,5,72,4,681，3,5,702,4,61108302000合并1，3，5,7，82,4,61,3,5,7，802,4,6110