Quantifiable Data Mining Using Principal Component

整理文档很辛苦,赏杯茶钱您下走!

免费阅读已结束,点击下载阅读编辑剩下 ...

阅读已结束,您可以下载文档离线阅读编辑

资源描述

QuantiableDataMiningUsingPrincipalComponentAnalysisFlipKorn,AlexandrosLabrinidis,YannisKotidis,ChristosFaloutsos,AlexKaplunovich,DejanPerkovicfflip,labrinid,kotidis,christos,kaplunov,dejanpg@cs.umd.eduDepartmentofComputerScienceUniversityofMarylandatCollegeParkAbstractAssociationRuleMiningalgorithmsoperateonadatamatrix(e.g.,customersproducts)toderiverules[2,22].Weproposeasingle-passalgorithmformininglinearrulesinsuchamatrixbasedonPrincipalComponentAnalysis.PCAdetectscorrelatedcolumnsofthematrix,whichcorrespondto,e.g.,productsthatselltogether.Therstcontributionofthisworkisthatweproposetoquantifythe\goodnessofasetofdis-coveredrules.Wedenethe\guessingerror:theroot-mean-squareerrorofthereconstructedvaluesofthecellsofthegivenmatrix,whenwepretendthattheyareunknown.Thesecondcontributionisanovelmethodtoguessmissing/hiddenvaluesfromthelinearrulesthatourmethodderives.Forexample,ifsomebodybought$10ofmilkand$3ofbread,ourrulescan\guesstheamountspenton,say,butter.Thus,wecanperformavarietyofimportanttaskssuchasforecasting,‘what-if’scenarios,outlierdetection,andvisualization.Moreover,weshowthatwecancomputetheprincipalcomponentswithasinglepassoverthedataset.Experimentsonrealdatasets(e.g.,NBAstatistics)demonstratethattheproposedmethodcon-sistentlyachievesa\guessingerrorofupto5timeslowerthanthestraightforwardcompetitor.1IntroductionDataMininghasrecentlybeenreceivingincreasinginterest[9],ofwhichthequintessentialproblemisassociationrulemining[2].Givenadatamatrix,with,e.g.,customersforrowsandproductsforcolumns,wewanttondrules.Existingalgorithmsndrulesoftheformfbread;milkg)butter,meaningthatcustomerswhobuy\breadand\milkalsotendtobuy\butter.WhatdistinguishesdatabaseworkfromAI/MachineLearningandstatisticsworkisitsemphasisonlargedatasets.TheinitialassociationruleminingpaperbyAgrawaletal.[2],aswellasallthefollow-updatabasework[4],proposedalgorithmstominimizethetimetoextracttheserulesthroughcleverrecord-keepingtoavoidadditionalpassesoverthedataset,throughparallelism,etc.ThisresearchwaspartiallyfundedbytheInstituteforSystemsResearch(ISR),bytheNationalScienceFoundationunderGrantsNo.EEC-94-02384,IRI-9205273andIRI-9625428.1Whatisnovelaboutthepresentworkisthatitattemptstoassesshowgoodthederivedrulesare,anissuethathasnotbeenemphasizedininthedatabaseliterature.Weproposethe\guessingerrorasameasureofthe\goodnessofagivensetofrulesforagivendataset.Theideaistopretendthatacellofthematrixis\hiddenfromus,andtotrytoguessthemissingvalueusingthederivedrules;theroot-mean-squareguessingerror(averagedoverallthecellsofthegivenmatrix)indicateshowgoodasetofrulesis.Thesecondmajorinnovationofthisworkistheuseofprincipalcomponentanalysis(PCA)toderivelinearrules,oftheform\customerstypicallyspend1:2:5dollarsonbread:milk:butter.Linearrulescanbeeasilyusedforextrapolations,andcanhelpdeterminehidden,missingandcorruptedvalues.Weprovidenovelalgorithmsforestimatingmissingvalues,evenifmultiplevaluesaresimultaneouslymissing/hidden.PCA(vialinearrules)cansupportthefollowingapplications,thankstotheirabilitytoreconstructmissing/hiddenvalues:Datacleaning:reconstructinglostdataandrepairingnoisyordamageddata;Forecasting:‘ifacustomerspends$1onbreadand$2.50onham,howmuchwills/hespendonmayonnaise?’;What-ifscenariosanddecisionsupport:‘WeexpectdoubleddemandofCheerios;howmuchmilkshouldwestockupon?’;Outlierdetection:‘Whichcustomersdeviatefromthetypicalsalespattern?’;Visualization:PCA,beingidenticaltotheKarhunen-Loevetransform[8],istheoptimaldimen-sionalityreductionmethod,mappingtherowsofthedatamatrixinto2-or3-dimensionalpoints,thatcanbeplottedtorevealthestructureofthedataset(e.g.,clusters,linearcorrelations,etc.);Thepaperisorganizedasfollows:Section2discussespastwork.Section3presentsanintroductiontoPCA.Section4introducestheproposedmethod.Section5givestheexperimentalresults.Section6providesadiscussion.Section7givessomeconclusionsandpointerstofuturework.2RelatedWorkAgrawaletal.distinguishbetweenthreedataminingproblems:identifyingclassications,ndingsequentialpatterns,anddiscoveringassociationrules[1].Wereviewonlymaterialrelevanttothelattersinceitisthefocusofthispaper.See[7]foranexcellent,recentsurveyofallthreeproblems.Theseminalworkof[2]introducedtheproblemofdiscoveringassociationrulesandpresentedanecientalgorithmforminingthem.Sincethen,newserialalgorithms[4,15,19]andparallelalgorithms[14,3,10]havebeenproposed.Inaddition,generalizedassociationruleshasbeenthesubjectofrecentwork[21,11].ThevastmajorityofassociationrulediscoverytechniquesarebasicallyBoolean,sincetheydiscardthequantitiesoftheitemsboughtandonlypayattentiontowhethersomethingwasboughtornot.AnotableexceptionistheworkofSrikantandAgrawal[22],wheretheyaddresstheproblemofminingquantitativeassociationrules.Theirapproachistopartitioneachquantitativeattributeintoaset2SymbolDenitionNnumberofrecordsMnumberofattributeskcuto(numberofprincipalcomponentsretained)hnumberofholesHsetofcellswhichhaveholesRMS1root-mean-squarederrorovereachholeRMShroot-mean-squarederroroverhholesmatrixmultiplicationXtheNMdatamatrixXcthecenteredver

1 / 21
下载文档,编辑使用

©2015-2020 m.777doc.com 三七文档.

备案号:鲁ICP备2024069028号-1 客服联系 QQ:2149211541

×
保存成功