ch04-数据流挖掘1

整理文档很辛苦,赏杯茶钱您下走!

免费阅读已结束,点击下载阅读编辑剩下 ...

阅读已结束,您可以下载文档离线阅读编辑

资源描述

MiningofMassiveDatasetsJureLeskovec,AnandRajaraman,JeffUllmanStanfordUniversity::MiningofMassiveDatasets,Inmanydataminingsituations,wedonotknowtheentiredatasetinadvanceStreamManagementisimportantwhentheinputrateiscontrolledexternally:GooglequeriesTwitterorFacebookstatusupdatesWecanthinkofthedataasinfiniteandnon-stationary(thedistributionchangesovertime)J.Leskovec,A.Rajaraman,J.Ullman:MiningofMassiveDatasets,Inputelementsenteratarapidrate,atoneormoreinputports(i.e.,streams)WecallelementsofthestreamtuplesThesystemcannotstoretheentirestreamaccessiblyQ:Howdoyoumakecriticalcalculationsaboutthestreamusingalimitedamountof(secondary)memory?J.Leskovec,A.Rajaraman,J.Ullman:MiningofMassiveDatasets,StochasticGradientDescent(SGD)isanexampleofastreamalgorithmInMachineLearningwecallthis:OnlineLearningAllowsformodelingproblemswherewehaveacontinuousstreamofdataWewantanalgorithmtolearnfromitandslowlyadapttothechangesindataIdea:DoslowupdatestothemodelSGD(SVM,Perceptron)makessmallupdatesSo:Firsttraintheclassifierontrainingdata.Then:Foreveryexamplefromthestream,weslightlyupdatethemodel(usingsmalllearningrate)J.Leskovec,A.Rajaraman,J.Ullman:MiningofMassiveDatasets,:MiningofMassiveDatasets,Typesofqueriesonewantsonansweronadatastream:(we’lldothesetoday)SamplingdatafromastreamConstructarandomsampleQueriesoverslidingwindowsNumberofitemsoftypexinthelastkelementsofthestreamJ.Leskovec,A.Rajaraman,J.Ullman:MiningofMassiveDatasets,Typesofqueriesonewantsonansweronadatastream:(we’lldothesenexttime)FilteringadatastreamSelectelementswithpropertyxfromthestreamCountingdistinctelementsNumberofdistinctelementsinthelastkelementsofthestreamEstimatingmomentsEstimateavg./std.dev.oflastkelementsFindingfrequentelementsJ.Leskovec,A.Rajaraman,J.Ullman:MiningofMassiveDatasets,MiningquerystreamsGooglewantstoknowwhatqueriesaremorefrequenttodaythanyesterdayMiningclickstreamsYahoowantstoknowwhichofitspagesaregettinganunusualnumberofhitsinthepasthourMiningsocialnetworknewsfeedsE.g.,lookfortrendingtopicsonTwitter,FacebookJ.Leskovec,A.Rajaraman,J.Ullman:MiningofMassiveDatasets,SensorNetworksManysensorsfeedingintoacentralcontrollerTelephonecallrecordsDatafeedsintocustomerbillsaswellassettlementsbetweentelephonecompaniesIPpacketsmonitoredataswitchGatherinformationforoptimalroutingDetectdenial-of-serviceattacksJ.Leskovec,A.Rajaraman,J.Ullman:MiningofMassiveDatasets,Sincewecannotstoretheentirestream,oneobviousapproachistostoreasampleTwodifferentproblems:(1)Sampleafixedproportionofelementsinthestream(say1in10)(2)MaintainarandomsampleoffixedsizeoverapotentiallyinfinitestreamAtany“time”kwewouldlikearandomsampleofselementsWhatisthepropertyofthesamplewewanttomaintain?Foralltimestepsk,eachofkelementsseensofarhasequalprob.ofbeingsampledJ.Leskovec,A.Rajaraman,J.Ullman:MiningofMassiveDatasets,Problem1:SamplingfixedproportionScenario:SearchenginequerystreamStreamoftuples:(user,query,time)Answerquestionssuchas:HowoftendidauserrunthesamequeryinasingledaysHavespacetostore1/10thofquerystreamNaïvesolution:Generatearandomintegerin[0..9]foreachqueryStorethequeryiftheintegeris0,otherwisediscardJ.Leskovec,A.Rajaraman,J.Ullman:MiningofMassiveDatasets,Simplequestion:Whatfractionofqueriesbyanaveragesearchengineuserareduplicates?Supposeeachuserissuesxqueriesonceanddqueriestwice(totalofx+2dqueries)Correctanswer:d/(x+d)Proposedsolution:Wekeep10%ofthequeriesSamplewillcontainx/10ofthesingletonqueriesand2d/10oftheduplicatequeriesatleastonceButonlyd/100pairsofduplicatesd/100=1/10∙1/10∙dOfd“duplicates”18d/100appearexactlyonce18d/100=((1/10∙9/10)+(9/10∙1/10))∙dSothesample-basedansweris𝑑100𝑥10+𝑑100+18𝑑100=𝒅𝟏𝟎𝒙+𝟏𝟗𝒅J.Leskovec,A.Rajaraman,J.Ullman:MiningofMassiveDatasets,Pick1/10thofusersandtakealltheirsearchesinthesampleUseahashfunctionthathashestheusernameoruseriduniformlyinto10buc

1 / 46
下载文档,编辑使用

©2015-2020 m.777doc.com 三七文档.

备案号:鲁ICP备2024069028号-1 客服联系 QQ:2149211541

×
保存成功