数据挖掘第二次作业

jiang491859090
0 ℃
2020-11-06

整理文档很辛苦，赏杯茶钱您下走！

还剩 ... 页未读，继续阅读 >>

免费阅读已结束，点击下载阅读编辑剩下 ... 页

阅读已结束，您可以下载文档离线阅读编辑

资源描述

数据挖掘第二次作业第一题：1.a)ComputetheInformationGainforGender,CarTypeandShirtSize.b)ConstructadecisiontreewithInformationGain.答案：a)因为class分为两类：C0和C1，其中C0的频数为10个，C1的频数为10，所以class元组的信息增益为Info(D)==11.按照Gender进行分类：Infogender(D)==0.971Gain(Gender)=1-0.971=0.0292.按照CarType进行分类InfocarType(D)==0.314Gain(CarType)=1-0.314=0.6863.按照ShirtSize进行分类：InfoshirtSize(D)==0.988Gain(ShirtSize)=1-0.988=0.012b)由a中的信息增益结果可以看出采用CarType进行分类得到的信息增益最大，所以决策树为：CarType?ShirtSize?C0C1familySportluxuryC0C1smallmedium,large,extralarge第二题：2.(a)Designamultilayerfeed-forwardneuralnetwork(onehiddenlayer)forthedatasetinQ1.Labelthenodesintheinputandoutputlayers.(b)Usingtheneuralnetworkobtainedabove,showtheweightvaluesafteroneiterationofthebackpropagationalgorithm,giventhetraininginstance“(M,Family,Small).Indicateyourinitialweightvaluesandbiasesandthelearningrateused.a)123456789101112x11x12x21x22x23x31x32x33x34输入层隐藏层输出层b)由a可以设每个输入单元代表的属性和初始赋值X11X12X21X22X23X31X32X33X34FMFamilySportsLuxurySmallMediumLargeExtraLarge011001000由于初始的权重和偏倚值是随机生成的所以在此定义初始值为：W1,10W1,11W2,10W2,11W3,10W3,11W4,10W4,11W5,10W5,110.20.2-0.2-0.10.40.3-0.2-0.10.1-0.1W6,10W6,11W7,10W7,11W8,10W8,11W9,10W9,11W10,12W11,120.1-0.2-0.40.20.20.2-0.10.3-0.3-0.1θ10θ11θ12-0.20.20.3净输入和输出：单元j净输入Ij输出Oj100.10.52110.20.55120.0890.48每个节点的误差表：单元jErrj100.0089110.003012-0.12权重和偏倚的更新：W1,10W1,11W2,10W2,11W3,10W3,11W4,10W4,11W5,10W5,110.2010.198-0.211-0.0990.40.308-0.202-0.0980.101-0.100W6,10W6,11W7,10W7,11W8,10W8,11W9,10W9,11W10,12W11,120.092-0.211-0.4000.1980.2010.190-0.1100.300-0.304-0.099θ10θ11θ12-0.2870.1790.344第三题：3.a)Supposethefractionofundergraduatestudentswhosmokeis15%andthefractionofgraduatestudentswhosmokeis23%.Ifone-ﬁfthofthecollegestudentsaregraduatestudentsandtherestareundergraduates,whatistheprobabilitythatastudentwhosmokesisagraduatestudent?b)Giventheinformationinpart(a),isarandomlychosencollegestudentmorelikelytobeagraduateorundergraduatestudent?c)Suppose30%ofthegraduatestudentsliveinadormbutonly10%oftheundergraduatestudentsliveinadorm.Ifastudentsmokesandlivesinthedorm,isheorshemorelikelytobeagraduateorundergraduatestudent?Youcanassumeindependencebetweenstudentswholiveinadormandthosewhosmoke.答：a)定义：A={A1,A2}其中A1表示没有毕业的学生，A2表示毕业的学生，B表示抽烟则由题意而知：P(B|A1)=15%P(B|A2)=23%P(A1)=P(A2)=则问题则是求P(A2|B)由166.0)()|B()()|B(B2211APApAPAPP则277.0166.02.023.0)()()|(|222BPAPABPBAPb)由a可以看出随机抽取一个抽烟的大学生，是毕业生的概率是0.277，未毕业的学生是0.723，所以有很大的可能性是未毕业的学生。c)设住在宿舍为事件C则P(C|A2)=30%P(C|A1)=10%14.0)()|C()()|C(C2211APApAPAPP023.014.0166.0)()()(CPBPBCP6.0023.02.03.023.0)()()|()|()|(2222BCPAPACPABPBCAP)|(1BCAP=0.4所以由上面的结果可以看出是毕业生的概率大一些第四题：4.Supposethatthedataminingtaskistoclusterthefollowingtenpoints(with(x,y,z)representinglocation)intothreeclusters:A1(4,2,5),A2(10,5,2),A3(5,8,7),B1(1,1,1),B2(2,3,2),B3(3,6,9),C1(11,9,2),C2(1,4,6),C3(9,1,7),C4(5,6,7)ThedistancefunctionisEuclideandistance.SupposeinitiallyweassignA1,B1,C1asthecenterofeachcluster,respectively.UsetheK-Meansalgorithmtoshowonly(a)Thethreeclustercenterafterthefirstroundexecution(b)Thefinalthreeclusters答：a)各点到中心点的欧式距离第一轮：A1B1C1A2549817A34110162B2146165B33393122C21434141C33010093C4217770从而得到的三个簇为：{A1,A3,B3,C2,C3,C4}{B1,B2}{C1,A2}所以三个簇新的中心为：(4.5,4.5,6.83)，(1.5,2,1.5)，(10.5,7,2)第二轮：新的簇均值为：(4.5,4.5,6.83)，(1.5,2,1.5)，(10.5,7,2)(4.5,4.5,6.83)(1.5,2,1.5)C1(10.5,7,2)A19.86111118.576.25A253.8611181.54.25A312.5277878.556.25B158.527781.5127.25B231.861111.588.25B39.19444474.5106.25C185.86111139.54.25C213.1944424.5115.25C332.5277887.563.25C42.52777858.556.25所以得到的新的簇为：{A1,A3,B3,C2,C3,C4}{B1,B2}{C1,A2}得到的新的簇跟第一轮结束得到的簇的结果相同，不再变化，所以上面的簇是最终的结果。PartII:LabQuestion1Assumethissupermarketwouldliketopromotemilk.Usethedatain“transactions”astrainingdatatobuildadecisiontree(C5.0algorithm)modeltopredictwhetherthecustomerwouldbuymilkornot.1.Buildadecisiontreeusingdataset“transactions”thatpredictsmilkasafunctionoftheotherfields.Setthe“type”ofeachfieldto“Flag”,setthe“direction”of“milk”as“out”,setthe“type”ofCODas“Typeless”,select“Expert”andsetthe“pruningseverity”to65,andsetthe“minimumrecordsperchildbranch”tobe95.Hand-in:Afigureshowingyourtree.2.Usethemodel(thefulltreegeneratedbyClementineinstep1above)tomakeapredictionforeachofthe20customersinthe“rollout”datatodeterminewhetherthecustomerwouldbuymilk.Hand-in:yourpredictionforeachofthe20customers.3.Hand-in:rulesforpositive(yes)predictionofmilkpurchaseidentifiedfromthedecisiontree(uptothefifthlevel.Therootisconsideredaslevel1).ComparewiththerulesgeneratedbyAprioriinHomework1,andsubmityourbriefcommentsontherules(e.g.,pruningeffect)答：1生成的决策树为：生成的决策树模型为：juices=1[Mode:1]water=1[Mode:1]=1water=0[Mode:0]pasta=1[Mode:1]=1pasta=0[Mode:0]tomatosouce=1[Mode:1]=1tomatosouce=0[Mode:0]biscuits=1[Mode:1]=1biscuits=0[Mode:0]=0juices=0[Mode:0]yoghurt=1[Mode:1]water=1[Mode:1]=1water=0[Mode:0]biscuits=1[Mode:1]=1biscuits=0[Mode:0]brioches=1[Mode:1]=1brioches=0[Mode:0]beer=1[Mode:1]=1beer=0[Mode:0]=0yoghurt=0[Mode:0]beer=1[Mode:0]biscuits=1[Mode:1]=1biscuits=0[Mode:0]rice=1[Mode:1]=1rice=0[Mode:0]coffee=1[Mode:1]water=1[Mode:1]=1water=0[Mode:0]