数据挖掘导论Pang-ningTan,MichaelStieinbach,andVipinKumar著PearsonEducationLTD.范明等译人民邮电出版社第6章关联分析:基本概念和算法6.1问题定义2019年10月20日星期日数据挖掘导论4AssociationRuleMiningGivenasetoftransactions,findrulesthatwillpredicttheoccurrenceofanitemsetbasedontheoccurrencesofotheritemsinthetransactionMarket-BaskettransactionsTIDItems1Bread,Milk2Bread,Diaper,Beer,Eggs3Milk,Diaper,Beer,Coke4Bread,Milk,Diaper,Beer5Bread,Milk,Diaper,CokeExampleofAssociationRules{Diaper}{Beer},{Milk,Bread}{Eggs,Coke},{Beer,Bread}{Milk},Implicationmeansco-occurrence,notcausality!2019年10月20日星期日数据挖掘导论5Definition:FrequentItemsetItemsetAcollectionofoneormoreitemsExample:{Milk,Bread,Diaper}k-itemsetAnitemsetthatcontainskitemsSupportcount()FrequencyofoccurrenceofanitemsetE.g.({Milk,Bread,Diaper})=2SupportFractionoftransactionsthatcontainanitemsetE.g.s({Milk,Bread,Diaper})=2/5FrequentItemsetAnitemsetwhosesupportisgreaterthanorequaltoaminsupthresholdTIDItems1Bread,Milk2Bread,Diaper,Beer,Eggs3Milk,Diaper,Beer,Coke4Bread,Milk,Diaper,Beer5Bread,Milk,Diaper,Coke2019年10月20日星期日数据挖掘导论6Definition:AssociationRuleAssociationRuleAnimplicationexpressionoftheformXY,whereXandYareitemsetsExample:{Milk,Diaper}{Beer}RuleEvaluationMetricsSupport(s)FractionoftransactionsthatcontainbothXandYConfidence(c)MeasureshowoftenitemsinYappearintransactionsthatcontainXTIDItems1Bread,Milk2Bread,Diaper,Beer,Eggs3Milk,Diaper,Beer,Coke4Bread,Milk,Diaper,Beer5Bread,Milk,Diaper,CokeExample:Beer}Diaper,Milk{4.052|T|)BeerDiaper,,Milk(s67.032)Diaper,Milk()BeerDiaper,Milk,(c2019年10月20日星期日数据挖掘导论7AssociationRuleMiningTaskGivenasetoftransactionsT,thegoalofassociationruleminingistofindallruleshavingsupport≥minsupthresholdconfidence≥minconfthresholdBrute-forceapproach:ListallpossibleassociationrulesComputethesupportandconfidenceforeachrulePrunerulesthatfailtheminsupandminconfthresholdsComputationallyprohibitive2019年10月20日星期日数据挖掘导论8MiningAssociationRulesObservations:Alltheaboverulesarebinarypartitionsofthesameitemset:{Milk,Diaper,Beer}RulesoriginatingfromthesameitemsethaveidenticalsupportbutcanhavedifferentconfidenceThus,wemaydecouplethesupportandconfidencerequirementsTIDItems1Bread,Milk2Bread,Diaper,Beer,Eggs3Milk,Diaper,Beer,Coke4Bread,Milk,Diaper,Beer5Bread,Milk,Diaper,CokeExampleofRules:{Milk,Diaper}{Beer}(s=0.4,c=0.67){Milk,Beer}{Diaper}(s=0.4,c=1.0){Diaper,Beer}{Milk}(s=0.4,c=0.67){Beer}{Milk,Diaper}(s=0.4,c=0.67){Diaper}{Milk,Beer}(s=0.4,c=0.5){Milk}{Diaper,Beer}(s=0.4,c=0.5)2019年10月20日星期日数据挖掘导论9MiningAssociationRulesTwo-stepapproach:1.FrequentItemsetGeneration–Generateallitemsetswhosesupportminsup2.RuleGeneration–Generatehighconfidencerulesfromeachfrequentitemset,whereeachruleisabinarypartitioningofafrequentitemsetFrequentitemsetgenerationisstillcomputationallyexpensive6.2频繁项集的产生2019年10月20日星期日数据挖掘导论11FrequentItemsetGenerationnullABACADAEBCBDBECDCEDEABCDEABCABDABEACDACEADEBCDBCEBDECDEABCDABCEABDEACDEBCDEABCDEGivenditems,thereare2dpossiblecandidateitemsets2019年10月20日星期日数据挖掘导论12FrequentItemsetGenerationBrute-forceapproach:EachitemsetinthelatticeisacandidatefrequentitemsetCountthesupportofeachcandidatebyscanningthedatabaseMatcheachtransactionagainsteverycandidateComplexity~O(NMw)=ExpensivesinceM=2d!!!TIDItems1Bread,Milk2Bread,Diaper,Beer,Eggs3Milk,Diaper,Beer,Coke4Bread,Milk,Diaper,Beer5Bread,Milk,Diaper,CokeNTransactionsListofCandidatesMw2019年10月20日星期日数据挖掘导论13FrequentItemsetGenerationStrategiesReducethenumberofcandidates(M)Completesearch:M=2dUsepruningtechniquestoreduceMReducethenumberofcomparisons(NM)UseefficientdatastructurestostorethecandidatesortransactionsNoneedtomatcheverycandidateagainsteverytransaction2019年10月20日星期日数据挖掘导论14ReducingNumberofCandidatesAprioriprinciple:Ifanitemsetisfrequent,thenallofitssubsetsmustalsobefrequentIfasubsetofanitemsetisinfrequent,thentheitemsetmustbeinfrequentAprioriprincipleholdsduetothefollowingpropertyofthesupportmeasure:SupportofanitemsetneverexceedsthesupportofitssubsetsThisisknownastheanti-monotonepropertyofsupport)()()(:,YsXsYXYX2019年10月20日星期日数据挖掘导论15IllustratingAprioriPrinciplenullABACADAEBCBDBECDCEDEABCDEABCABDABEACDACEADEBCDBCEBDECDEABCDABCEABDEACDEBCDEABCDEPrunedsupersetsFoundtobeInfrequent2019年10月20日星期日数据挖掘导论16IllustratingAprioriPrincipleItemCountBread4Coke2Milk4Beer3Diaper4Eggs1ItemsetCount{Bread,Milk}3{Bread,Beer}2{Bread,Diaper}3{Milk,Beer}2{Milk,Diaper}3{Beer,Diaper}3Items(1-itemsets)Pairs(2-itemsets)Triplets(3-itemsets)MinimumSupport=3Ifeverysubsetisconsidered,Withsupport-basedpruning,6+6+1=1341362616CCCItemsetCount{Bread,Milk,Diaper}3(NoneedtogeneratecandidatesinvolvingCokeorEggs)TIDItems1Bread,Milk2Bread,Diaper,Beer,Eggs3Milk,Diaper,Beer,Coke4Bread,Milk,Diaper,Beer5Bread,Milk,Diaper,Coke2019年10月20日星期日数据挖掘导论17AprioriAlgorithmMethod:(Algorithm6.1)Letk=1Generatefrequentitemsetsoflength1RepeatuntilnonewfrequentitemsetsareidentifiedGenerateCk+1:Generatelength(k+1)candidateitemsetsfromlengthkfrequentitemsets,Prune:prunecandidateitemsetscontainingsubsetsoflengthkthatareinfreque