Introduction to data mining-lecture notes chap06_b

整理文档很辛苦,赏杯茶钱您下走!

免费阅读已结束,点击下载阅读编辑剩下 ...

阅读已结束,您可以下载文档离线阅读编辑

资源描述

DataMiningAssociationAnalysis:BasicConceptsandAlgorithmsLectureNotesforChapter6IntroductiontoDataMiningbyTan,Steinbach,Kumar©Tan,Steinbach,KumarIntroductiontoDataMining4/18/20041©Tan,Steinbach,KumarIntroductiontoDataMining4/18/20042AssociationRuleMiningzGivenasetoftransactions,findrulesthatwillpredicttheoccurrenceofanitembasedontheoccurrencesofotheritemsinthetransactionMarket-BaskettransactionsTIDItems1Bread,Milk2Bread,Diaper,Beer,Eggs3Milk,Diaper,Beer,Coke4Bread,Milk,Diaper,Beer5Bread,Milk,Diaper,CokeExampleofAssociationRules{Diaper}→{Beer},{Milk,Bread}→{Eggs,Coke},{Beer,Bread}→{Milk},Implicationmeansco-occurrence,notcausality!©Tan,Steinbach,KumarIntroductiontoDataMining4/18/20043Definition:FrequentItemsetzItemset–Acollectionofoneormoreitems‹Example:{Milk,Bread,Diaper}–k-itemset‹AnitemsetthatcontainskitemszSupportcount(σ)–Frequencyofoccurrenceofanitemset–E.g.σ({Milk,Bread,Diaper})=2zSupport–Fractionoftransactionsthatcontainanitemset–E.g.s({Milk,Bread,Diaper})=2/5zFrequentItemset–AnitemsetwhosesupportisgreaterthanorequaltoaminsupthresholdTIDItems1Bread,Milk2Bread,Diaper,Beer,Eggs3Milk,Diaper,Beer,Coke4Bread,Milk,Diaper,Beer5Bread,Milk,Diaper,Coke©Tan,Steinbach,KumarIntroductiontoDataMining4/18/20044Definition:AssociationRuleExample:Beer}Diaper,Milk{⇒4.052|T|)BeerDiaper,,Milk(===σs67.032)Diaper,Milk()BeerDiaper,Milk,(===σσczAssociationRule–AnimplicationexpressionoftheformX→Y,whereXandYareitemsets–Example:{Milk,Diaper}→{Beer}zRuleEvaluationMetrics–Support(s)‹FractionoftransactionsthatcontainbothXandY–Confidence(c)‹MeasureshowoftenitemsinYappearintransactionsthatcontainXTIDItems1Bread,Milk2Bread,Diaper,Beer,Eggs3Milk,Diaper,Beer,Coke4Bread,Milk,Diaper,Beer5Bread,Milk,Diaper,Coke©Tan,Steinbach,KumarIntroductiontoDataMining4/18/20045AssociationRuleMiningTaskzGivenasetoftransactionsT,thegoalofassociationruleminingistofindallruleshaving–support≥minsupthreshold–confidence≥minconfthresholdzBrute-forceapproach:–Listallpossibleassociationrules–Computethesupportandconfidenceforeachrule–Prunerulesthatfailtheminsupandminconfthresholds⇒Computationallyprohibitive!©Tan,Steinbach,KumarIntroductiontoDataMining4/18/20046MiningAssociationRulesExampleofRules:{Milk,Diaper}→{Beer}(s=0.4,c=0.67){Milk,Beer}→{Diaper}(s=0.4,c=1.0){Diaper,Beer}→{Milk}(s=0.4,c=0.67){Beer}→{Milk,Diaper}(s=0.4,c=0.67){Diaper}→{Milk,Beer}(s=0.4,c=0.5){Milk}→{Diaper,Beer}(s=0.4,c=0.5)TIDItems1Bread,Milk2Bread,Diaper,Beer,Eggs3Milk,Diaper,Beer,Coke4Bread,Milk,Diaper,Beer5Bread,Milk,Diaper,CokeObservations:•Alltheaboverulesarebinarypartitionsofthesameitemset:{Milk,Diaper,Beer}•Rulesoriginatingfromthesameitemsethaveidenticalsupportbutcanhavedifferentconfidence•Thus,wemaydecouplethesupportandconfidencerequirements©Tan,Steinbach,KumarIntroductiontoDataMining4/18/20047MiningAssociationRuleszTwo-stepapproach:1.FrequentItemsetGeneration–Generateallitemsetswhosesupport≥minsup2.RuleGeneration–Generatehighconfidencerulesfromeachfrequentitemset,whereeachruleisabinarypartitioningofafrequentitemsetzFrequentitemsetgenerationisstillcomputationallyexpensive©Tan,Steinbach,KumarIntroductiontoDataMining4/18/20048FrequentItemsetGenerationnullABACADAEBCBDBECDCEDEABCDEABCABDABEACDACEADEBCDBCEBDECDEABCDABCEABDEACDEBCDEABCDEGivenditems,thereare2dpossiblecandidateitemsets©Tan,Steinbach,KumarIntroductiontoDataMining4/18/20049FrequentItemsetGenerationzBrute-forceapproach:–Eachitemsetinthelatticeisacandidatefrequentitemset–Countthesupportofeachcandidatebyscanningthedatabase–Matcheachtransactionagainsteverycandidate–Complexity~O(NMw)=ExpensivesinceM=2d!!!TIDItems1Bread,Milk2Bread,Diaper,Beer,Eggs3Milk,Diaper,Beer,Coke4Bread,Milk,Diaper,Beer5Bread,Milk,Diaper,CokeTransactions©Tan,Steinbach,KumarIntroductiontoDataMining4/18/200410ComputationalComplexityzGivenduniqueitems:–Totalnumberofitemsets=2d–Totalnumberofpossibleassociationrules:1231111+−=⎥⎦⎤⎢⎣⎡⎟⎠⎞⎜⎝⎛−×⎟⎠⎞⎜⎝⎛=+−=−=∑∑dddkkdjjkdkdRIfd=6,R=602rules©Tan,Steinbach,KumarIntroductiontoDataMining4/18/200411FrequentItemsetGenerationStrategieszReducethenumberofcandidates(M)–Completesearch:M=2d–UsepruningtechniquestoreduceMzReducethenumberoftransactions(N)–ReducesizeofNasthesizeofitemsetincreases–UsedbyDHPandvertical-basedminingalgorithmszReducethenumberofcomparisons(NM)–Useefficientdatastructurestostorethecandidatesortransactions–Noneedtomatcheverycandidateagainsteverytransaction©Tan,Steinbach,KumarIntroductiontoDataMining4/18/200412ReducingNumberofCandidateszAprioriprinciple:–Ifanitemsetisfrequent,thenallofitssubsetsmustalsobefrequentzAprioriprincipleholdsduetothefollowingpropertyofthesupportmeasure:–Supportofanitemsetneverexceedsthesupportofitssubsets–Thisisknownastheanti-monotonepropertyofsupport)()()(:,YsXsYXYX≥⇒⊆∀©Tan,Steinbach,KumarIntroductiontoDataMining4/18/200413FoundtobeInfrequentIllustratingAprioriPrinciplePrunedsupersets©Tan,Steinbach,KumarIntroductiontoDataMining4/18/200414IllustratingAprioriPrincipleItemCountBread4Coke2Milk4Beer3Diaper4Eggs1ItemsetCount{Bread,Milk}3{Bread,Beer}2{Bread,Diaper}3{Milk,Beer}2{Milk,Diaper}3{Beer,Dia

1 / 82
下载文档,编辑使用

©2015-2020 m.777doc.com 三七文档.

备案号:鲁ICP备2024069028号-1 客服联系 QQ:2149211541

×
保存成功