一个食品连锁店每周的事务记录如下表所示,其中每一条事务表示在一项收款机业务中卖出的项目,假定supmin=20%,confmin=40%,使用Apriori算法计算生成的关联规则,标明每趟数据库扫描时的候选集和大项目集。事务项目事务项目T1T2T3面包、果冻、花生酱面包、花生酱面包、牛奶、花生酱T4T5啤酒、面包啤酒、牛奶解:1)扫描数据库对每个候选计算支持2)比较候选支持度与最小支持度,得出频繁项集L13)由L1产生候选C2C2项集{面包,花生酱}{面包,牛奶}{面包,啤酒}{面包,果冻}{花生酱,牛奶}{花生酱,啤酒}{花生酱,果冻}{牛奶,啤酒}{牛奶,果冻}{啤酒,果冻}4)扫描,对每个候选计算支持度C1项集支持度{面包}{花生酱}{牛奶}{啤酒}{果冻}4/53/52/52/51/5L1项集支持度{面包}{花生酱}{牛奶}{啤酒}{果冻}4/53/52/52/51/5C2项集支持度{面包,花生酱}{面包,牛奶}{面包,啤酒}{面包,果冻}{花生酱,牛奶}{花生酱,啤酒}{花生酱,果冻}{牛奶,啤酒}{牛奶,果冻}{啤酒,果冻}3/51/51/51/51/501/51/5005)比较候选支持度与最小支持度,得出频繁项集L2L2项集支持度{面包,花生酱}{面包,牛奶}{面包,啤酒}{面包,果冻}{花生酱,牛奶}{花生酱,果冻}{牛奶,啤酒}3/51/51/51/51/51/51/56)由L2产生候选C3C3项集{面包,花生酱,牛奶}{面包,花生酱,啤酒}{面包,花生酱,果冻}{面包,牛奶,啤酒}{面包,牛奶,果冻}{面包,啤酒,果冻}{花生酱,牛奶,果冻}{花生酱,牛奶,啤酒}7)扫描,对每个候选计算支持度C3项集支持度{面包,花生酱,牛奶}{面包,花生酱,啤酒}{面包,花生酱,果冻}{面包,牛奶,啤酒}{面包,牛奶,果冻}{面包,啤酒,果冻}{花生酱,牛奶,果冻}{花生酱,牛奶,啤酒}1/501/5000008)比较候选支持度与最小支持度,得出频繁项集L3C3项集支持度{面包,花生酱,牛奶}{面包,花生酱,果冻}1/51/5下面计算关联规则:1{面包,花生酱,牛奶}的非空子集有{面包,花生酱},{面包,牛奶},{花生酱,牛奶},{面包},{花生酱},{牛奶}{面包,花生酱}{牛奶}confidence=5/35/1=33.3%{面包,牛奶}{花生酱}confidence=5/15/1=100%{花生酱,牛奶}{面包}confidence=5/15/1=100%{面包}{花生酱,牛奶}confidence=5/45/1=25%{花生酱}{面包,牛奶}confidence=5/35/1=33.3%{牛奶}{面包,花生酱}confidence=5/25/1=50%故强关联规则有{面包,牛奶}{花生酱},{花生酱,牛奶}{面包},{牛奶}{面包,花生酱}2{面包,花生酱,果冻}的非空子集有{面包,花生酱},{面包,果冻},{花生酱,果冻},{面包},{花生酱},{果冻}{面包,花生酱}{果冻}confidence=5/35/1=33.3%{面包,果冻}{花生酱}confidence=5/15/1=100%{花生酱,果冻}{面包}confidence=5/15/1=100%{面包}{花生酱,果冻}confidence=5/45/1=25%{花生酱}{面包,果冻}confidence=5/35/1=33.3%{果冻}{面包,花生酱}confidence5/15/1=100%故强关联规则有{面包,果冻}{花生酱},{花生酱,果冻}{面包},{果冻}{面包,花生酱}Thefollowingshowsahistoryofcustomerswiththeirincomes,agesandanattributecalled“Have_iPhone”indicatingwhethertheyhaveaniPhone.WealsoindicatewhethertheywillbuyaniPadornotinthelastcolumn.No.IncomeAgeHave_iPhoneBuy_iPad1highyoungyesyes2higholdyesyes3mediumyoungnoyes4higholdnoyes5mediumyoungnono6mediumyoungnono7mediumoldnono8mediumoldnono(a)WewanttotrainaCARTdecisiontreeclassifiertopredictwhetheranewcustomerwillbuyaniPadornot.WedefinethevalueofattributeBuy_iPadisthelabelofarecord.(i)PleasefindaCARTdecisiontreeaccordingtotheaboveexample.Inthedecisiontree,wheneverweprocessanodecontainingatmost3records,westoptoprocessthisnodeforsplitting.(ii)ConsideranewyoungcustomerwhoseincomeismediumandhehasaniPhone.PleasepredictwhetherthisnewcustomerwillbuyaniPadornot.(b)WhatisthedifferencebetweentheC4.5decisiontreeandtheID3decisiontree?Whyisthereadifference?解:解:a.(i)对于所给定样本的期望信息是:-84log284-84log284=1属性Income的样本:Info(high)=-3log21-0log20=0Info(medium)=-51log251-54log254=0.72193期望信息为:83×0+85×0.72193=0.27072信息增益为:Gain(Income)=1-E(Income)=0.729277同样计算知:Gain(Age)=0.09436Gain(Have_iPhone)=0.311这三个属性中Income的Gain最大,所以选择Income为最优特征,于是根节点生成两个子节点,一个是叶节点,对另一个节点继续使用以上方法,在A2,A3选择最优特征及其最优切分点,结果是Age。依此计算得,CART树为:YoungOldmediumYesAgeIncomeHighNONO(ii)这个新的年轻、中等收入、有IPhone的顾客,将不会购买IPad。(b)C4.5决策树算法和ID3算法相似,但是C4.5决策树算法是对ID3算法的改进,ID3算法在生成决策树的过程中,使用信息增益来进行特征选择,是选择信息增益最大的特征;C4.5算法在生成决策树的过程中,用信息增益比来选择特征,是选择信息增益比最大的特征。因为信息增益的大小是相对于训练数据集而言的,并没有绝对的意义,在分类困难时,也就是在训练数据集的经验熵大的时候,信息增益会偏大,反之,信息增益会偏小。使用信息增益比可以对这一问题进行校正。Considerthefollowingeighttwo-dimensionaldatapoints:x1:(23,12),x2:(6,6),x3:(15,0),x4:(15,28),x5:(20,9),x6:(8,9),x7:(20,11),x8:(8,13),Consideralgorithmk-means.Pleaseanswerthefollowingquestions.Youarerequiredtoshowtheinformationabouteachfinalcluster(includingthemeanoftheclusterandalldatapointsinthiscluster).Youcanconsiderwritingaprogramforthispartbutyouarenotrequiredtosubmittheprogram.(a)Ifk=2andtheinitialmeansare(20,9)and(8,9),whatistheoutputofthealgorithm?(b)Ifk=2andtheinitialmeansare(15,0)and(15,29),whatistheoutputofthealgorithm?解:(a)已知K=2,初始质心是(20,9)、(8,9)则:M1M2K1K2(20,9)(8,9)(20,9),(23,12),(15,0),(15,28),(20,11)(8,9),(6,6),(8,13)(18.6,12)(7.3,9.3)(23,12),(15,28),(20,9),(20,11)}(15,0),(6,6),(8,9),(8,13)(19.5,15)(9.5,7)(23,12),(15,28),(20,9),(20,11)(15,0),(6,6),(8,9),(8,13)所以,算法输出两个簇:K1={x1,x4,x5,x7}K2={x2,x3,x6,x8}(b)已知K=2,初始质心是(15,0)、(15,29)则:M1M2K1K2(15,0)(15,29)(23,12),(6,6),(15,0),(20,9),(8,9),(20,11),(8,13)(15,28)(14.3,8.6)(15,28)(23,12),(6,6),(15,0),(20,9),(8,9),(20,11),(8,13)(15,28)所以,算法输出两个簇:K1={x1,x2,x3,x5,x6,x7,x8}K2={x4}4.ConsidereightdatapointsThefollowingmatrixshowsthepairwisedistancesbetweenanytwopoints.1234567810211035130412214057171180613415520079151216151908112012211722300Pleaseusetheagglomerationapproachtoclustertheseeightpointsintotwogroups/clustersbyusingdistancecompletelinkage.Pleasewritedownalldatapointsforeachclusterandwritedownthedistancebetweenthetwoclusters.35距离1合并为簇(3,5)123456781021103513041221405717118061341552007915121615190811201221172230024距离2合并为簇(2,4)123,546781021103,551304122140613415507915121619081120122122300(2,4)6距离4合并为簇(2,4,6)12,43,5678102,41103,5513061341507915121908112012223001距离(3,5)为5合并为簇(1,3,5)12,4,63,578102,4,61103,5513079151208112012300(1,3,5)距离7为9合并为簇(1,3,5,7)1,3,52,4,6781,3,502,4,61107915081120300(1,3,5,7)距离8为11合并为簇(1,3,5,7,8)1,3,5,72,4,681,3,5,702,4,61108302000合并1,3,5,7,82,4,61,3,5,7,802,4,6110