Parallel Bifold Large-Scale Parallel Pattern Minin

整理文档很辛苦,赏杯茶钱您下走!

免费阅读已结束,点击下载阅读编辑剩下 ...

阅读已结束,您可以下载文档离线阅读编辑

资源描述

ParallelBifold:Large-ScaleParallelPatternMiningwithConstraintsMohammadEl-Hajj,OsmarR.Za¨ıaneDepartmentofComputingScience,UofA,Edmonton,AB,Canada{mohammad,zaiane}@cs.ualberta.caUniversityOfAlbertaAbstract.Whencomputationallyfeasible,mininghugedatabasespro-ducestremendouslylargenumbersoffrequentpatterns.Inmanycases,itisimpracticaltominethosedatasetsduetotheirsheersize;notonlytheextentoftheexistingpatterns,butmainlythemagnitudeofthesearchspace.Manyapproacheshavesuggestedtheuseofconstraintstoapplytothepatternsorsearchingforfrequentpatternsinparallel.Sofar,thoseapproachesarestillnotgenuinelyeffectivetomineextremelylargedatasets.Weproposeamethodthatcombinesbothstrategiesefficiently,i.e.min-inginparallelforthesetofpatternswhilepushingconstraints.Usingthisapproachwecouldminesignificantlylargedatasets;withsizesneverreportedintheliteraturebefore.Weareabletoeffectivelydiscoverfre-quentpatternsinadatabasemadeofbilliontransactionsusinga32processorsclusterinlessthan2hours.1IntroductionFrequentItemsetMining(FIM)isakeycomponentofmanyalgorithmswhichex-tractpatternsfromtransactionaldatabases.Forexample,FIMcanbeleveragedtoproduceassociationrules,clusters,classifiersorcontrastsets.Thiscapabilityprovidesastrategicresourcefordecisionsupport,andismostcommonlyusedformarketbasketanalysis.Onechallengeforfrequentitemsetminingisthepotentiallyhugenumberofextractedpatterns,whichcaneclipsetheoriginaldatabaseinsize.Inadditiontoincreasingthecostofmining,thismakesitmoredifficultforuserstofindthevaluablepatterns.Introducingconstraintstotheminingprocesshelpsmitigatebothissues.Decisionmakerscanrestrictdiscoveredpatternsaccordingtospec-ifiedrules.Byapplyingtheserestrictionsasearlyaspossible,thecostofminingcanbeconstrained.Forexample,usersmaybeinterestedinpurchaseswhosetotalpriceexceeds$100,orwhoseitemscostbetween$50and$100.Whilediscoveringhiddenknowledgeintheavailablerepositoriesofdataisanimportantgoalfordecisionmakers,discoveringthisknowledgeina“reasonable”timeiscapital.Despitetheincreaseindatacollection,therapidityofthepatterndiscoveryremainsvitalandwillalwaysbeessential.Speedinguptheprocessofknowledgediscoveryhasbecomeacriticalproblem,andparallelismisshowntoDistributedandParalleldatabases(Springer)200620:225-2432beapotentialsolutionforsuchascalabilitypredicament.Naturally,paralleliza-tionisnottheonlyandshouldnotbethefirstsolutiontospeedupthedataminingprocess.Indeed,otherapproachesmighthelpinachievingthisgoal,suchassampling,attributeselection,restrictionofsearchspace,andalgorithmorcodeoptimization[15].Someoftheseapproachesmightbeusedinconjunctionwithparallelismtoachievethedesiredspeedup.Alegitimateissueiswhetherparallelismisneededindatamining.Efficiencyiscrucialinknowledgediscoverysystems,andwiththeexplosivegrowthofdatacollection,sequentialdatamin-ingalgorithmshavebecomeanunacceptablesolutiontomostrealsizeproblemsevenaftercleveroptimizations.Toillustratethecomplexityoftheproblemoffrequentitemsetenumerationintoday’srealdata,assumeasmalltokencasewithonly5possibleitems(i.e.astorethatsellsonly5distinctproducts),thelatticethatrepresentsallpossiblecandidatefrequentpatternshas25−1=31itemsets.Applicationsthatgeneratetransactionswithsizesgreaterthan100itemspertransactionarecommon.Inthosecases,tofindafrequentitemsetwithsize100,itwouldtakeasearchspaceof2100−1=1.27∗1030itemsets.Addingthefactthatmostrealtransactionaldatabasesareintheorderofmil-lions,ifnotbillions,oftransactionsandtheproblembecomesintractablewithcurrentsequentialsolutions.Withhundredsofgigabytes,andoftenterabytesandthousandsofdistinctitems,itisunrealisticforoneprocessortominethedatasequentially,especiallywhenmultiplepassesovertheseenormousdatabasesarerequired.Therearedifferentdesignissuesthataffectbuildingparallelfrequentminingalgorithms[35,34].Thesedesignissuesaresignificantlyaffectedbythespecifi-cationoftheproblemthatthesystemistryingtosolve.Constraintbasedminingisanongoingareaofresearchwheretwoimportantcategoriesofconstraintsmonotoneandanti-monotone[20]arestudiedinthiswork.Anti-monotoneconstraintsareconstraintsthatwhenvalidforapattern,theyareconsequentiallyvalidforanysubsetsubsumedbythepattern.Mono-toneconstraintswhenvalidforapatternareinevitablyvalidforanysupersetsubsumingthatpattern.Thestraightforwardwaytodealwithconstraintsistousethemasafilterpost-mining.Howeveritismoreefficienttoconsidertheconstraintsduringtheminingprocess.Thisiswhatisrefereedtoas“pushingtheconstraints”[24].Mostexistingalgorithmsleverage(orpush)oneofthesetypesduringminingandpostponetheothertoapost-processingphase.1.1ProblemStatementTheproblemofminingassociationrulesovermarketbasketanalysiswasintro-ducedin[1,2].Theproblemconsistsoffindingassociationsbetweenitemsoritemsetsintransactionaldata.Thedataistypicallyretailsalesintheformofcustomertransactions,butcanbeanydatathatcanbemodeledintotransac-tions.Forexamplemedicalimageswhereeachimageismodeledbyatransactionofvisualfeaturesfromtheimage[33],ortextdatawhereeachdocumentismod-eledbyatransactionrepresentingabagofwords[14],orwebaccessdatawhereclick-streamvisitationismodeledbysetsoftransactions[1

1 / 21
下载文档,编辑使用

©2015-2020 m.777doc.com 三七文档.

备案号:鲁ICP备2024069028号-1 客服联系 QQ:2149211541

×
保存成功