2020年2月10日星期一DataMining:ConceptsandTechniques1Chapter5:MiningFrequentPatterns,AssociationandCorrelationsBasicconceptsandaroadmapEfficientandscalablefrequentitemsetminingmethodsMiningvariouskindsofassociationrulesFromassociationminingtocorrelationanalysisConstraint-basedassociationminingSummary2020年2月10日星期一DataMining:ConceptsandTechniques2WhatIsFrequentPatternAnalysis?Frequentpattern:apattern(asetofitems,subsequences,substructures,etc.)thatoccursfrequentlyinadatasetFirstproposedbyAgrawal,Imielinski,andSwami[AIS93]inthecontextoffrequentitemsetsandassociationruleminingMotivation:FindinginherentregularitiesindataWhatproductswereoftenpurchasedtogether?—Beeranddiapers?!WhatarethesubsequentpurchasesafterbuyingaPC?WhatkindsofDNAaresensitivetothisnewdrug?Canweautomaticallyclassifywebdocuments?ApplicationsBasketdataanalysis,cross-marketing,catalogdesign,salecampaignanalysis,Weblog(clickstream)analysis,andDNAsequenceanalysis.2020年2月10日星期一DataMining:ConceptsandTechniques3WhyIsFreq.PatternMiningImportant?DisclosesanintrinsicandimportantpropertyofdatasetsFormsthefoundationformanyessentialdataminingtasksAssociation,correlation,andcausalityanalysisSequential,structural(e.g.,sub-graph)patternsPatternanalysisinspatiotemporal,multimedia,time-series,andstreamdataClassification:associativeclassificationClusteranalysis:frequentpattern-basedclusteringDatawarehousing:icebergcubeandcube-gradientSemanticdatacompression:fasciclesBroadapplications2020年2月10日星期一DataMining:ConceptsandTechniques4BasicConcepts:FrequentPatternsandAssociationRulesItemsetX={x1,…,xk}FindalltherulesXYwithminimumsupportandconfidencesupport,s,probabilitythatatransactioncontainsXYconfidence,c,conditionalprobabilitythatatransactionhavingXalsocontainsYLetsupmin=50%,confmin=50%Freq.Pat.:{A:3,B:3,D:4,E:3,AD:3}Associationrules:AD(60%,100%)DA(60%,75%)CustomerbuysdiaperCustomerbuysbothCustomerbuysbeerTransaction-idItemsbought10A,B,D20A,C,D30A,D,E40B,E,F50B,C,D,E,F2020年2月10日星期一DataMining:ConceptsandTechniques5ClosedPatternsandMax-PatternsAlongpatterncontainsacombinatorialnumberofsub-patterns,e.g.,{a1,…,a100}contains(1001)+(1002)+…+(110000)=2100–1=1.27*1030sub-patterns!Solution:Mineclosedpatternsandmax-patternsinsteadAnitemsetXisclosedifXisfrequentandthereexistsnosuper-patternYכX,withthesamesupportasX(proposedbyPasquier,etal.@ICDT’99)AnitemsetXisamax-patternifXisfrequentandthereexistsnofrequentsuper-patternYכX(proposedbyBayardo@SIGMOD’98)Closedpatternisalosslesscompressionoffreq.patternsReducingthe#ofpatternsandrules2020年2月10日星期一DataMining:ConceptsandTechniques6ClosedPatternsandMax-PatternsExercise.DB={a1,…,a100,a1,…,a50}Min_sup=1.Whatisthesetofcloseditemset?a1,…,a100:1a1,…,a50:2Whatisthesetofmax-pattern?a1,…,a100:1Whatisthesetofallpatterns?!!2020年2月10日星期一DataMining:ConceptsandTechniques7Chapter5:MiningFrequentPatterns,AssociationandCorrelationsBasicconceptsandaroadmapEfficientandscalablefrequentitemsetminingmethodsMiningvariouskindsofassociationrulesFromassociationminingtocorrelationanalysisConstraint-basedassociationminingSummary2020年2月10日星期一DataMining:ConceptsandTechniques8ScalableMethodsforMiningFrequentPatternsThedownwardclosurepropertyoffrequentpatternsAnysubsetofafrequentitemsetmustbefrequentIf{beer,diaper,nuts}isfrequent,sois{beer,diaper}i.e.,everytransactionhaving{beer,diaper,nuts}alsocontains{beer,diaper}Scalableminingmethods:ThreemajorapproachesApriori(Agrawal&Srikant@VLDB’94)Freq.patterngrowth(FPgrowth—Han,Pei&Yin@SIGMOD’00)Verticaldataformatapproach(Charm—Zaki&Hsiao@SDM’02)2020年2月10日星期一DataMining:ConceptsandTechniques9Apriori:ACandidateGeneration-and-TestApproachAprioripruningprinciple:Ifthereisanyitemsetwhichisinfrequent,itssupersetshouldnotbegenerated/tested!(Agrawal&Srikant@VLDB’94,Mannila,etal.@KDD’94)Method:Initially,scanDBoncetogetfrequent1-itemsetGeneratelength(k+1)candidateitemsetsfromlengthkfrequentitemsetsTestthecandidatesagainstDBTerminatewhennofrequentorcandidatesetcanbegenerated2020年2月10日星期一DataMining:ConceptsandTechniques10TheAprioriAlgorithm—AnExampleDatabaseTDB1stscanC1L1L2C2C22ndscanC3L33rdscanTidItems10A,C,D20B,C,E30A,B,C,E40B,EItemsetsup{A}2{B}3{C}3{D}1{E}3Itemsetsup{A}2{B}3{C}3{E}3Itemset{A,B}{A,C}{A,E}{B,C}{B,E}{C,E}Itemsetsup{A,B}1{A,C}2{A,E}1{B,C}2{B,E}3{C,E}2Itemsetsup{A,C}2{B,C}2{B,E}3{C,E}2Itemset{B,C,E}Itemsetsup{B,C,E}2Supmin=22020年2月10日星期一DataMining:ConceptsandTechniques11TheAprioriAlgorithmPseudo-code:Ck:CandidateitemsetofsizekLk:frequentitemsetofsizekL1={frequentitems};for(k=1;Lk!=;k++)dobeginCk+1=candidatesgeneratedfromLk;foreachtransactiontindatabasedoincrementthecountofallcandidatesinCk+1thatarecontainedintLk+1=candidatesinCk+1withmin_supportendreturnkLk;2020年2月10日星期一DataMining:ConceptsandTechniques12ImportantDetailsofAprioriHowtogeneratecandidates?Step1:self-joiningLkStep2:pruningHowtocountsupportsofcandidates?ExampleofCandidate-generationL3={abc,abd,acd,ace,bcd}Self-joining:L3*L3abcdfromabcandabdacdefromacdandacePruning:acdeisremovedb