数据流挖掘

整理文档很辛苦,赏杯茶钱您下走!

免费阅读已结束,点击下载阅读编辑剩下 ...

阅读已结束,您可以下载文档离线阅读编辑

资源描述

Email:jiangshengyi@163.com„„„„„„Web„‰„2006332005182004152003152002620013200011„‰„TriggerMan„OpenCQ(GeorgiaTech.)„Niagara-CQ(WisconsinMadison)„CACQ„Aurora(MIT/Brown/Brandies)„AdaptiveCQ„TelegraphCQ(U.C.Berkeley)„STREAM(Stanford)User/ApplicationUser/ApplicationRegisterRegisterQueryQueryStreamQueryProcessorResultsResultsScratchSpaceScratchSpace(Memoryand/orDisk)(Memoryand/orDisk)DataStreamManagementSystem(DSMS)2„‰„‰Sketch‰exponentialhistogram,EH‰„/‰Stickysampling/lossycounting‰‰CountingBloomfilter‰falsepositive3„‰‰K-meansFayyadetal./Guhaetal.‰Hanetal.„‰‰HeoffdingtreeVFDTGibbonsetal.‰VFDTCVFDTGibbonsetal.‰emsembleHanetal.4„‰timeseries„sequentialpattern„„…‰Muthukrishnanetal.‰changeGehrkeetal.‰burstShashaetal.,Kleinberg‰Shashaetal.‰……(1)„2‰StandfordSTREAM(Stanfordstreamdatamanager)projectR.Motwani‰UIUCMAIDSminingalarmingincidentsindatastreamC.AggarwalJ.Han„(2)„„‰/„countBloomfilter[CIKM’2003]„falsepositive[VLDB’2004]‰[DASFAA’2003]‰[SDM'2006]‰(burst)[DASFAA'2005]‰[ICDE'2006]‰(toolkit)(3)„‰„‰„‰„„UIUCMAIDS~hanjCharacteristicsofDataStreams„DataStreams‰Datastreams—continuous,ordered,changing,fast,hugeamount‰TraditionalDBMS—datastoredinfinite,persistentdatasetsdatasets„Characteristics‰Hugevolumesofcontinuousdata,possiblyinfinite‰Fastchangingandrequiresfast,real-timeresponse‰Randomaccessisexpensive—singlelinearscanalgorithm(canonlyhaveonelook)‰Storeonlythesummaryofthedataseenthusfar‰Moststreamdataareatprettylow-levelormulti-dimensionalinnature,needsmulti-levelandmulti-dimensionalprocessingStreamDataApplications„Telecommunicationcallingrecords„Business:creditcardtransactionflows„Networkmonitoringandtrafficengineering„Financialmarket:stockexchange„Engineering&industrialprocesses:powersupply&manufacturing„Sensor,monitoring&surveillance:videostreams„Securitymonitoring„WeblogsandWebpageclickstreams„Massivedatasets(evensavedbutrandomaccessistooexpensive)„…………DBMSversusDSMS„Persistentrelations„One-timequeries„Randomaccess„“Unbounded”diskstore„Onlycurrentstatematters„Noreal-timeservices„Relativelylowupdaterate„Dataatanygranularity„Assumeprecisedata„Accessplandeterminedbyqueryprocessor,physicalDBdesign„Transientstreams„Continuousqueries„Sequentialaccess„Boundedmainmemory„Historicaldataisimportant„Real-timerequirements„Possiblymulti-GBarrivalrate„Dataatfinegranularity„Datastale/imprecise„Unpredictable/variabledataarrivalandcharacteristicsAck.FromMotwani’sPODStutorialslidesChallengesofStreamDataProcessing„Multiple,continuous,rapid,time-varying,orderedstreams„Mainmemorycomputations„Queriesareoftencontinuous‰Evaluatedcontinuouslyasstreamdataarrives‰Answerupdatedovertime„Queriesareoftencomplex‰Beyondelement-at-a-timeprocessing‰Beyondstream-at-a-timeprocessing‰Beyondrelationalqueries(scientific,datamining,OLAP)„Multi-level/multi-dimensionalprocessinganddatamining‰Moststreamdataareatprettylow-levelormulti-dimensionalinnatureProcessingStreamQueries„Querytypes‰One-timequeryvs.continuousquery(beingevaluatedcontinuouslyasstreamcontinuestoarrive)‰Predefinedqueryvs.ad-hocquery(issuedon-line)„Unboundedmemoryrequirements‰Forreal-timeresponse,mainmemoryalgorithmshouldbeused‰Memoryrequirementisunboundedifonewilljoinfuturetuples„Approximatequeryanswering‰Withboundedmemory,itisnotalwayspossibletoproduceexactanswers‰High-qualityapproximateanswersaredesired‰Datareductionandsynopsisconstructionmethods„Sketches,randomsampling,histograms,wavelets,etc.MethodsforApproximateQueryAnswering„Slidingwindows‰Onlyoverslidingwindowsofrecentstreamdata‰Approximationbutoftenmoredesirableinapplications„Batchedprocessing,samplingandsynopses‰Batchedifupdateisfastbutcomputingisslow„Computeperiodically,notverytimely‰Samplingifupdateisslowbutcomputingisfast„Computeusingsampledata,butnotgoodforjoins,etc.‰Synopsisdatastructures„Maintainasmallsynopsisorsketchofdata„Goodforqueryinghistoricaldata„Blockingoperators,e.g.,sorting,avg,min,etc.‰BlockingifunabletoproducethefirstoutputuntilseeingtheentireinputStreamDataMiningvs.StreamQuerying„Streammining—Amorechallengingtaskinmanycases‰Itsharesmostofthedifficultieswithstreamquerying„Butoftenrequiresless“precision”,e.g.,nojoin,grouping,sorting‰Patternsarehiddenandmoregeneralthanquerying‰Itmayrequireexploratoryanalysis„Notnecessarilycontinuousqueries„Streamdataminingtasks‰Multi-dimensionalon-lineanalysisofstreams‰Miningoutliersandunusualpatternsinstreamdata‰Clusteringdatastreams‰ClassificationofstreamdataChallengesforMiningDynamicsinDataStreams„Moststreamdataareatprettylow-levelormulti-dimensionalinnature:needsML/MDprocessing„Analysisrequirements‰Multi-dimensionaltrendsandunusualpatterns‰Capturingimportantchangesatmulti-dimensions/levels‰Fast,real-timedetectionandresponse‰Comparingwithdatacube:Similarityanddifferences„Stream(data)cubeorstreamOLAP:Isthisfeasible?‰Canweimplementitefficiently?Multi-DimensionalStreamAnalysis:Examples„AnalysisofWebclickstreams‰Rawdataatlowlevels:seconds,webpageaddresses,userIPaddresses,…‰Analystswant:changes,trends,unusualpatterns,atreasonablelevelsofdetails‰E.g.,AverageclickingtrafficinNor

1 / 45
下载文档,编辑使用

©2015-2020 m.777doc.com 三七文档.

备案号:鲁ICP备2024069028号-1 客服联系 QQ:2149211541

×
保存成功