序列模式序列模式简介GSP算法PrefixSpan算法本讲内容2序列模式简介3序列模式简介4序列模式简介项目集(Itemset):是各种项目组成的集合序列(Sequence):是不同项目集(ItemSet)的有序排列,序列s可以表示为s=s1s2…sl,sj(1=j=l)为项目集(Itemset),也称为序列s的元素序列的长度:一个序列所包含的项目集(ItemSet)的个数。5序列模式简介设=a1a2…an,=b1b2…bm,如果存在整数1=j1j2…jn=m,使得a1bj1,a2bj2,…,anbjn,则称序列为序列的子序列,又称序列包含序列,记为例如,序列(3)(45)(8)是序列(7)(38)(9)(456)(8)的子序列。但,(3)(5)不是(35)的子序列,反之亦然。给定一个序列集合,如果序列s不包含于任何一个其它的序列中,则称s是最大的(MaximalSequence)。6序列模式简介Customer-sequenceAllthetransactionsofacustomer,orderedbyincreasingtransaction-time,correspondstoasequence.序列在序列数据库S中的支持数为序列数据库S中包含序列的序列个数,记为Support()序列模式挖掘:找出所有最大的频繁子序列。7itemset的支持度:itemset作为一个整体,在整个数据库中出现的比例,比如,单次购物时同时购买该itemset中所有项目的顾客占全部顾客的百分比itemset的支持度等于1-sequence的支持度litemset(Largeitemset):AitemsetwithminimumsupportCandidatesequenceLargesequence序列模式简介ExampleQ.Howtofindthesequentialpatterns?ExampleItemItemsetTransaction以Customer_Id及TransactionTime排序Example(cont.)Sequence(30)(90)issupportedbycustomer1and430(4070)issupportedbycustomer2and4Withminimumsupportof2customers:Thelargeitemset(litemset):(30),(40),(70),(4070),(90)3-Sequence11GSP算法GSP算法FivephasesSortphaseLargeitemsetphaseTransformationphaseSequencephaseMaximalphaseAprioriSortthedatabasewithcustomer-idasthemajorkeyandtransaction-timeastheminorkeySortphaseFindthelargeitemset.ItemsetsmappingLitemsetphaseReason:可把litemset当作一个整体对待。比较两个litemsets是否相等用的时间是恒定的,在判断某个序列是否包含于另一个序列时可以省时间。TransformationphaseDeletingnon-largeitemsetsMappinglargeitemsetstointegersSequencephaseUsethesetoflitemsetstofindthelargesequence(frequentsequence).MaximalphaseFindthemaximumsequencesamongthesetoflargesequences.Insomealgorithms,thisphaseiscombinedwiththesequencephase.MaximalphaseAlgorithm:Sthesetofalllargesequencesnthelengthofthelongestsequencefor(k=n;k1;k--)doforeachk-sequenceskdoDeletefromSallsubsequencesofskAprioriAll算法ThebasicmethodtominesequentialpatternsBasedontheApriorialgorithm.Countallthelargesequences,includingnon-maximalsequences.UseApriori-generatefunctiontogeneratecandidatesequence.AprioriCandidateGenerationgeneratecandidatesforpassusingonlythelargesequencesfoundinthepreviouspassandthenmakesapassoverthedatatofindtheirsupport.Algorithm:Lk-1thesetofalllarge(k-1)-sequencesCkthesetofcandidatek-sequencesAprioriCandidateGenerationJoinPhase:insertintoCkselectp.litemset1,p.litemset2,…,p.litemsetk-1,q.litemsetk-1fromLk-1p,Lk-1qwherep.litemset1=q.litemset1,…,p.litemsetk-2=q.litemsetk-2;PrungPhase:forallsequencescCkdoforall(k-1)-subsequencessofcdoif(sLk-1)thendeletecfromCk;Example关联规则与序列模式生成候选集时的区别join序列模式需要关注顺序例如,有如下三条序列100(10)(20)200(20)(10)200(20)(10)假设minsupp=1,(10)(20)都是litemset,分别map为1,2则生成C2时需要同时包含12和21可以发现,序列(10)(20)和(20)(10)都是所需要的序列模式。如果join时不考虑顺序的话,将丢掉(20)(10)。AprioriAll(cont.)L1={large1-sequences};//Resultoflitemsetphasefor(k=2;Lk-1≠Φ;k++)dobeginCk=NewcandidategeneratefromLk-1foreachcustomer-sequencecinthedatabasedoIncrementthecountofallcandidatesinCkthatarecontainedincLk=CandidatesinCkwithminimumsupport.EndAnswer=MaximalSequencesinLk;24Example关联规则与序列模式对itmeset计数时的区别如果相同的项目集合在同一个序列里出现多次的话,只记一次例如,有如下三条序列100(10)(20)(10)200(20)200(20)(30)假设minsuff=2,则(20)是litemset,而(10)只能算出现了一次,因而不是litemset.Example:(CustomerSequences)AprioriCandidateGeneration{15}{2}{3}{4}{1}{3}{4}{35}{1}{2}{3}{4}{1}{3}{5}{4}{5}nextstep:findthelarge1-sequencesWithminimumsetto25%nextstep:findthelarge2-sequencesSequenceSupport12345{15}{2}{3}{4}{1}{3}{4}{35}{1}{2}{3}{4}{1}{3}{5}{4}{5}ExampleLarge1-Sequence4244427Example12211331144115512332244225523443355345541221133114411551233224422552344335534554L1joinL2SequenceSupport1422344454prungSequenceSupport122134143153232242343352452{15}{2}{3}{4}{1}{3}{4}{35}{1}{2}{3}{4}{1}{3}{5}{4}{5}28123132124142125152134143135153145154234243345354123132124142125152134143135153145154234243345354122134143153232242343352452L2joinL3prung{15}{2}{3}{4}{1}{3}{4}{35}{1}{2}{3}{4}{1}{3}{5}{4}{5}12321242134313522342291234124313451354123412321242134313522342L3joinL4prung{15}{2}{3}{4}{1}{3}{4}{35}{1}{2}{3}{4}{1}{3}{5}{4}{5}12342SequenceSupport12342ExampleSequenceSupport1422344452SequenceSupport122134143153232242343352452SequenceSupport12321242134313522342FindthemaximallargesequencesjudgmentWastetoomuchtimeincountingnon-maximalsequence,whichisimpossibletobeasequentialpattern.AprioriSomeItgeneratescandidatesforapassusingonlythelargesequencesfoundinthepreviouspass.Dividedinto2phase:forwardvs.backwardAdvantage:Reducecountingtimewastedincountingnon-maximalones.Nextfunction||/||kkCLhitsAprioriSome(cont.)//ForwardPhaseL1={Large1-sequences};//ResultofthelitemsetphaseC1=L1;//sothatwehaveaniceloopconditionLast=1;//welastcountedClastfor(k=2;Ck-1≠φandLlast≠φ;k++)dobeginif