A General Measure of Rule Interestingness

整理文档很辛苦,赏杯茶钱您下走!

免费阅读已结束,点击下载阅读编辑剩下 ...

阅读已结束,您可以下载文档离线阅读编辑

资源描述

AGeneralMeasureofRuleInterestingnessSzymonJaroszewiz,DanA.SimoviiJanuary15,2002AbstratThepaperpresentsanewgeneralmeasureofruleinterestingness.Manyknownmeasuressuhas2,ginigainorentropygainanbeob-tainedfromthismeasurebysettingsomenumerialparametersrepresent-ingtheamountoftrustwehaveintheestimatesofertainprobabilitiesfromthedata.Moreoverweshowthatthereisaontinuumofmeasureshaving2,Ginigainandentropygainasboundaryases.Propertiesandexperimentalevaluationofthenewmeasurearealsopresented.Keywords:interestingnessmeasure,distribution,CziserdivergeneKullbak-Leiblerdivergene,rule.1IntrodutionDeterminingtheinterestingnessofrulesisanimportantdataminingprob-lem.Manydataminingalgorithmsprodueenormousamountsofrules,makingitimpossiblefortheusertoanalyzeallofthembyhand.Itisthusessentialtoestablishsomemeasurebywhihrulesinterestingnessanbeexpressednumeriallyandused,forexample,tosortthedisoveredrules.Manysuhmeasureshavebeenproposed,andusedinliterature(see[1℄forasurvey).InthispaperweonentrateonmeasuresthatassesshowmuhknowledgewegainonthejointdistributionofasetofattributesQfromtheknowingthejointdistributionofsomesetofattributesP.Examplesofsuhmeasuresareentropygain,mutualinformation,Ginigain,2[7,9,3,1,11,10℄.Therulesonsideredherearethusdierentfromassoiationrulesstudiedindatamining,sineweonsiderfulljointdistributionsofbothanteedentandonsequent,whileassoiationrulesonsideronlytheprobabilityofallattributeshavingsomespeiedvalue.Thisapproahhastheadvantageofnaturalappliabilitytomulitvaluedattributes.Inthispaperwedemonstratethatalltheabovementionedmeasuresarespeialasesofamoregeneralparametrimeasureofinterestingness,andbyhoosingtwonumerialparametersaontinuumofmeasuresanbeobtainedontainingseveralwell-knowninterestingmeasuresasspeialases.Next,wegivesomeessentialdenitions.1Denition1Aprobabilitydistributionisamatrixoftheform=x1xmp1pm;wherepi0for1imandPmi=1pi=1.isanuniformdistributionifp1==pm=1m.Anm-valueduniformdistributionwillbedenotedbyUm.Let=(T;H;)beadatabasetable,whereTisthenameofthetable,Hisitsheading,andisitsontent.IfA2Hisanattributeof,thedomainofAinisdenotedbydom(A).Theprojetionofatuplet2onasetofattributesLHisdenotedbyt[L℄.Formoreonrelationalnotationandterminologysee[13℄.Denition2ThedistributionofasetofattributesL=fA1;:::;AngisthematrixL;=‘1‘rp1pr;(1)wherer=Qnj=1jdom(Aj)j,‘i2dom(A1)dom(An),andpi=jt2jt[L℄=‘ijjjfor1ir.Thesubsriptwillbeomittedwhenthetableislearfromontext.SupposethatthedistributionoftheattributesetLinthetable=(T;H;)isL=‘1‘rp1pr:TheHavrda-Charvat-entropyoftheattributesetL(see[6℄)isdenedas:H(L)=11rXj=1pj1!:Thelimitase,whentendstowards1yieldstheShannonentropy:H(L)=rXj=1pjlogpjAnotherimportantaseisobtainedwhen=2.Inthisase,weobtaintheGiniindexofL(see[1℄)givenby:gini(L)=1rXj=1p2j:IfL;KaretwosetsofattributesofatablethathavethedistributionsL=l1lmp1pm;andK=k1knq1qn;2thentheonditionalShannonentropyofLonditioneduponKisgivenbyH(LjK)=mXi=1nXj=1pijlogpijqj;wherepij=jft2jt[L℄=‘iandt[K℄=kjgjjjfor1imand1jn.Similarly,theGinionditionalindexofthesedistributionsis:gini(LjK)=1mXi=1nXj=1p2ijqj:ThesedenitionsallowustointroduetheShannongain(alledentropygaininliterature[7℄)andtheGinigaindenedas:gaingini(L;K)=gini(L)gini(LjK);gainshannon(L;K)=H(L)H(LjK)=H(L)+H(K)H(L[K);(2)respetively.NotiethattheShannongainisidentialtothemutualinformationbetweenattributesetsPandQ[7℄.FortheGinigainweanwrite:gaingini(L;K)=mXi=1nXj=1p2ijqjmXi=1p2i(3)TheprodutofthedistributionsP;Q,whereP=x1xmp1pm;andQ=y1ynq1qn;isthedistributionPQ=(x1;y1)(xm;yn)p1q1pmqn:TheattributesetsP;QareindependentifPQ=PQ,wherePQisanabbreviationforP[Q.Denition3Aruleisapairofattributesets(P;Q).IfP;QH,where=(T;H;)isatable,thenwereferto(P;Q)asaruleof.If(P;Q)isarule,thenwerefertoPastheanteedentandtoQastheonsequentoftherule.Arule(P;Q)willbedenoted,followingtheprevalentonventionintheliterature,byP!Q.Thisbroaderdenitionofrulesoriginatesin[3℄,whereruleswerere-plaedbydependeniesinordertoapturestatistialdependeneinboththepreseneandabseneofitemsinitemsets.Thesignianeofthisdependenewasmeasuredbythe2test,andourapproahisafurtherextensionofthatpointofview.Thenotionofdistributiondivergeneisentraltotherestofthepaper.3Denition4LetDbethelassofdistributions.Adistributiondiver-geneisafuntionD:DD!Rsuhthat:1.D(;0)0andD(;0)=0ifandonlyif=0forevery;02D.2.When0isxed,D(;0)isaonvexfuntionof;inotherwords,if=a11++akk,wherea1+:::+ak=1,thenD(;0)kXi=1aiD(i;0):AnimportantlassofdistributiondivergeneswasobtainedbyCziszarin[4℄as:D(;0)=nXi=1qipiqi;where=k1knp1pn;and0=l1lnq1qn;aretwodistributionsand:R!Risatwiedierentiableonvexfuntionsuhthat(1)=0.Wewillalsomakeanadditionalassumptionthat0(00)=0tohandletheasewhenforsomeibothpiandqiarezero.Ifforsomei,pi0,andqi=0thevalueofD(;0)isundened.TheCziszardivergenesatisesproperties(1)and(2)givenabove(see[6℄).ThefollowingresultshowstheinvarianeofCziszardivergenewithrespettodistributionprodut:The

1 / 17
下载文档,编辑使用

©2015-2020 m.777doc.com 三七文档.

备案号:鲁ICP备2024069028号-1 客服联系 QQ:2149211541

×
保存成功