数据挖掘第二次作业第一题:1.a)ComputetheInformationGainforGender,CarTypeandShirtSize.b)ConstructadecisiontreewithInformationGain.答案:a)因为class分为两类:C0和C1,其中C0的频数为10个,C1的频数为10,所以class元组的信息增益为Info(D)==11.按照Gender进行分类:Infogender(D)==0.971Gain(Gender)=1-0.971=0.0292.按照CarType进行分类InfocarType(D)==0.314Gain(CarType)=1-0.314=0.6863.按照ShirtSize进行分类:InfoshirtSize(D)==0.988Gain(ShirtSize)=1-0.988=0.012b)由a中的信息增益结果可以看出采用CarType进行分类得到的信息增益最大,所以决策树为:CarType?ShirtSize?C0C1familySportluxuryC0C1smallmedium,large,extralarge第二题:2.(a)Designamultilayerfeed-forwardneuralnetwork(onehiddenlayer)forthedatasetinQ1.Labelthenodesintheinputandoutputlayers.(b)Usingtheneuralnetworkobtainedabove,showtheweightvaluesafteroneiterationofthebackpropagationalgorithm,giventhetraininginstance“(M,Family,Small).Indicateyourinitialweightvaluesandbiasesandthelearningrateused.a)123456789101112x11x12x21x22x23x31x32x33x34输入层隐藏层输出层b)由a可以设每个输入单元代表的属性和初始赋值X11X12X21X22X23X31X32X33X34FMFamilySportsLuxurySmallMediumLargeExtraLarge011001000由于初始的权重和偏倚值是随机生成的所以在此定义初始值为:W1,10W1,11W2,10W2,11W3,10W3,11W4,10W4,11W5,10W5,110.20.2-0.2-0.10.40.3-0.2-0.10.1-0.1W6,10W6,11W7,10W7,11W8,10W8,11W9,10W9,11W10,12W11,120.1-0.2-0.40.20.20.2-0.10.3-0.3-0.1θ10θ11θ12-0.20.20.3净输入和输出:单元j净输入Ij输出Oj100.10.52110.20.55120.0890.48每个节点的误差表:单元jErrj100.0089110.003012-0.12权重和偏倚的更新:W1,10W1,11W2,10W2,11W3,10W3,11W4,10W4,11W5,10W5,110.2010.198-0.211-0.0990.40.308-0.202-0.0980.101-0.100W6,10W6,11W7,10W7,11W8,10W8,11W9,10W9,11W10,12W11,120.092-0.211-0.4000.1980.2010.190-0.1100.300-0.304-0.099θ10θ11θ12-0.2870.1790.344第三题:3.a)Supposethefractionofundergraduatestudentswhosmokeis15%andthefractionofgraduatestudentswhosmokeis23%.Ifone-fifthofthecollegestudentsaregraduatestudentsandtherestareundergraduates,whatistheprobabilitythatastudentwhosmokesisagraduatestudent?b)Giventheinformationinpart(a),isarandomlychosencollegestudentmorelikelytobeagraduateorundergraduatestudent?c)Suppose30%ofthegraduatestudentsliveinadormbutonly10%oftheundergraduatestudentsliveinadorm.Ifastudentsmokesandlivesinthedorm,isheorshemorelikelytobeagraduateorundergraduatestudent?Youcanassumeindependencebetweenstudentswholiveinadormandthosewhosmoke.答:a)定义:A={A1,A2}其中A1表示没有毕业的学生,A2表示毕业的学生,B表示抽烟则由题意而知:P(B|A1)=15%P(B|A2)=23%P(A1)=P(A2)=则问题则是求P(A2|B)由166.0)()|B()()|B(B2211APApAPAPP则277.0166.02.023.0)()()|(|222BPAPABPBAPb)由a可以看出随机抽取一个抽烟的大学生,是毕业生的概率是0.277,未毕业的学生是0.723,所以有很大的可能性是未毕业的学生。c)设住在宿舍为事件C则P(C|A2)=30%P(C|A1)=10%14.0)()|C()()|C(C2211APApAPAPP023.014.0166.0)()()(CPBPBCP6.0023.02.03.023.0)()()|()|()|(2222BCPAPACPABPBCAP)|(1BCAP=0.4所以由上面的结果可以看出是毕业生的概率大一些第四题:4.Supposethatthedataminingtaskistoclusterthefollowingtenpoints(with(x,y,z)representinglocation)intothreeclusters:A1(4,2,5),A2(10,5,2),A3(5,8,7),B1(1,1,1),B2(2,3,2),B3(3,6,9),C1(11,9,2),C2(1,4,6),C3(9,1,7),C4(5,6,7)ThedistancefunctionisEuclideandistance.SupposeinitiallyweassignA1,B1,C1asthecenterofeachcluster,respectively.UsetheK-Meansalgorithmtoshowonly(a)Thethreeclustercenterafterthefirstroundexecution(b)Thefinalthreeclusters答:a)各点到中心点的欧式距离第一轮:A1B1C1A2549817A34110162B2146165B33393122C21434141C33010093C4217770从而得到的三个簇为:{A1,A3,B3,C2,C3,C4}{B1,B2}{C1,A2}所以三个簇新的中心为:(4.5,4.5,6.83),(1.5,2,1.5),(10.5,7,2)第二轮:新的簇均值为:(4.5,4.5,6.83),(1.5,2,1.5),(10.5,7,2)(4.5,4.5,6.83)(1.5,2,1.5)C1(10.5,7,2)A19.86111118.576.25A253.8611181.54.25A312.5277878.556.25B158.527781.5127.25B231.861111.588.25B39.19444474.5106.25C185.86111139.54.25C213.1944424.5115.25C332.5277887.563.25C42.52777858.556.25所以得到的新的簇为:{A1,A3,B3,C2,C3,C4}{B1,B2}{C1,A2}得到的新的簇跟第一轮结束得到的簇的结果相同,不再变化,所以上面的簇是最终的结果。PartII:LabQuestion1Assumethissupermarketwouldliketopromotemilk.Usethedatain“transactions”astrainingdatatobuildadecisiontree(C5.0algorithm)modeltopredictwhetherthecustomerwouldbuymilkornot.1.Buildadecisiontreeusingdataset“transactions”thatpredictsmilkasafunctionoftheotherfields.Setthe“type”ofeachfieldto“Flag”,setthe“direction”of“milk”as“out”,setthe“type”ofCODas“Typeless”,select“Expert”andsetthe“pruningseverity”to65,andsetthe“minimumrecordsperchildbranch”tobe95.Hand-in:Afigureshowingyourtree.2.Usethemodel(thefulltreegeneratedbyClementineinstep1above)tomakeapredictionforeachofthe20customersinthe“rollout”datatodeterminewhetherthecustomerwouldbuymilk.Hand-in:yourpredictionforeachofthe20customers.3.Hand-in:rulesforpositive(yes)predictionofmilkpurchaseidentifiedfromthedecisiontree(uptothefifthlevel.Therootisconsideredaslevel1).ComparewiththerulesgeneratedbyAprioriinHomework1,andsubmityourbriefcommentsontherules(e.g.,pruningeffect)答:1生成的决策树为:生成的决策树模型为:juices=1[Mode:1]water=1[Mode:1]=1water=0[Mode:0]pasta=1[Mode:1]=1pasta=0[Mode:0]tomatosouce=1[Mode:1]=1tomatosouce=0[Mode:0]biscuits=1[Mode:1]=1biscuits=0[Mode:0]=0juices=0[Mode:0]yoghurt=1[Mode:1]water=1[Mode:1]=1water=0[Mode:0]biscuits=1[Mode:1]=1biscuits=0[Mode:0]brioches=1[Mode:1]=1brioches=0[Mode:0]beer=1[Mode:1]=1beer=0[Mode:0]=0yoghurt=0[Mode:0]beer=1[Mode:0]biscuits=1[Mode:1]=1biscuits=0[Mode:0]rice=1[Mode:1]=1rice=0[Mode:0]coffee=1[Mode:1]water=1[Mode:1]=1water=0[Mode:0]