Distances between data sets based on summary stati

整理文档很辛苦,赏杯茶钱您下走!

免费阅读已结束,点击下载阅读编辑剩下 ...

阅读已结束,您可以下载文档离线阅读编辑

资源描述

JournalofMachineLearningResearch8(2007)131-154Submitted2/05;Revised9/05;Published1/07DistancesbetweenDataSetsBasedonSummaryStatisticsNikolajTattiNTATTI@CC.HUT.FIHIITBasicResearchUnitLaboratoryofComputerandInformationScienceHelsinkiUniversityofTechnology,FinlandEditor:DaleSchuurmansAbstractTheconceptsofsimilarityanddistancearecrucialindatamining.Weconsidertheproblemofdefiningthedistancebetweentwodatasetsbycomparingsummarystatisticscomputedfromthedatasets.Theinitialdefinitionofourdistanceisbasedongeometricalnotionsofcertainsetsofdistributions.Weshowthatthisdistancecanbecomputedincubictimeandthatithasseveralintuitiveproperties.WealsoshowthatthisdistanceistheuniqueMahalanobisdistancesatisfyingcertainassumptions.Wealsodemonstratethatifwearedealingwithbinarydatasets,thenthedistancecanberepresentednaturallybycertainparityfunctions,andthatitcanbeevaluatedinlineartime.Ourempiricaltestswithrealworlddatashowthatthedistanceworkswell.Keywords:dataminingtheory,complexdata,binarydata,itemsets1.IntroductionInthispaperwewillconsiderthefollowingproblem:GiventwodatasetsD1andD2ofdimensionK,defineadistancebetweenD1andD2.Tobemoreprecise,weconsidertheproblemofdefiningthedistancebetweentwomultisetsoftransactions,eachsetsampledfromitsownunknowndistribution.WewilldefineadissimilaritymeasurebetweenD1andD2andwewillrefertothismeasureasCMdistance.Generallyspeaking,thenotionofdissimilaritybetweentwoobjectsisoneofthemostfunda-mentalconceptsindatamining.Ifoneisabletoretrieveadistancematrixfromasetofobjects,thenoneisabletoanalysedatabyusingforexample,clusteringorvisualisationtechniques.Manyrealworlddatacollectionsmaybenaturallydividedintoseveraldatasets.Forexample,ifadatacollectionconsistsofmoviesfromdifferenteras,thenwemaydividethemoviesintosubcollec-tionsbasedontheirreleaseyears.Adistancebetweenthesedata(sub)setswouldprovidemeanstoanalysethemassingleobjects.Suchanapproachmayeasethetaskofunderstandingcomplexdatacollections.LetuscontinuebyconsideringthepropertiestheCMdistanceshouldhave.Firstofall,itshouldbeametric.Themotivationbehindthisrequirementisthatthemetrictheoryisawell-knownareaandmetricshavemanytheoreticalandpracticalvirtues.Secondly,inourscenariothedatasetshavestatisticalnatureandtheCMdistanceshouldtakethisintoaccount.Forexample,considerthatbothdatasetsaregeneratedfromthesamedistribution,thentheCMdistanceshouldgivesmallvaluesandapproach0asthenumberofdatapointsinthedatasetsincreases.ThethirdrequirementisthatweshouldbeabletoevaluatetheCMdistancequickly.Thisrequirementiscrucialsincewemayhavehighdimensionaldatasetswithavastamountofdatapoints.c2007NikolajTatti.TATTITheCMdistancewillbebasedonsummarystatistics,features.Letusgiveasimpleexample:AssumethatwehavedatasetsD1=fA;B;A;AgandD2=fA;B;C;BgandassumethattheonlyfeatureweareinterestedinistheproportionofAinthedatasets.ThenwecansuggestthedistancebetweenD1andD2tobej3=41=4j=1=2.TheCMdistanceisbasedonthisidea;however,thereisasubtledifficulty:Ifwecalculateseveralfeatures,thenshouldwetakeintoaccountthecorrelationofthesefeatures?WewilldoexactlythatindefiningtheCMdistance.Therestofthispaperisorganisedasfollows.InSection2wegivethedefinitionoftheCMdistancebyusingsomegeometricalinterpretations.Wealsostudythepropertiesofthedistanceandprovideanalternativecharacterisation.InSection3westudytheCMdistanceandbinarydatasets.InSection4wediscusshowtheCMdistancecanbeusedwitheventsequencesandinSection5wecommentaboutthefeatureselection.Section6isdevotedforrelatedwork.TheempiricaltestsarerepresentedinSection7andweconcludeourworkwiththediscussioninSection8.2.TheConstrainedMinimumDistanceInthefollowingsubsectionwewilldefineourdistanceusinggeometricalintuitionandshowthatthedistancecanbeevaluatedefficiently.Inthesecondsubsectionwewilldiscussvariouspropertiesofthedistance,andinthelastsubsectionwewillprovideanalternativejustificationtothedistance.Theaimofthisjustificationistoprovidemoretheoreticalevidenceforourdistance.2.1TheDefinitionWebeginbygivingsomebasicdefinitions.ByadatasetDwemeanafinitecollectionofsampleslyinginsomefinitespaceΩ.ThesetΩiscalledsamplespace,andfromnowonwewilldenotethisspacebytheletterΩ.ThenumberofelementsinΩisdenotedbyjΩj.ThenumberofsamplesinthedatasetDisdenotedbyjDj.Aswesaidintheintroduction,ourgoalisnottodefineadistancedirectlyondatasetsbutratherthroughsomestatisticsevaluatedfromthedatasets.Inordertodoso,wedefineafeaturefunctionS:Ω!RNtomapapointinthesamplespacetoarealvector.ThroughoutthissectionSwillindicatesomegivenfeaturefunctionandNwillindicatethedimensionoftherangespaceofS.WewillalsodenotetheithcomponentofSbySi.Notethatifwehaveseveralfeaturefunctions,thenwecanjointhemintoonebigfeaturefunction.Afrequencyθ2RNofStakenwithrespecttoadatasetDistheaverageofvaluesofStakenoverthedataset,thatis,θ=1jDj∑ω2DS(ω).WedenotethisfrequencybyS(D).AlthoughwedonotmakeanyassumptionsconcerningthesizeofΩ,someofourchoicesaremotivatedbythinkingthatjΩjcanbeverylarge—solargethateventhesimplestoperation,say,enumeratingalltheelementsinΩ,isnottractable.Ontheotherhand,weassumethatNissuchthatanalgorithmexecutablein,say,O(N3)timeisfeasible.Ino

1 / 24
下载文档,编辑使用

©2015-2020 m.777doc.com 三七文档.

备案号:鲁ICP备2024069028号-1 客服联系 QQ:2149211541

×
保存成功