stanford大学-大数据挖掘-web mining overview2

整理文档很辛苦,赏杯茶钱您下走!

免费阅读已结束,点击下载阅读编辑剩下 ...

阅读已结束,您可以下载文档离线阅读编辑

资源描述

CS345ADataMiningLecture1IntroductiontoWebMiningWhatisWebMining?DiscoveringusefulinformationfromtheWorld-WideWebanditsusagepatternsWebMiningv.DataMiningStructure(orlackofit)TextualinformationandlinkagestructureScaleDatageneratedperdayiscomparabletolargestconventionaldatawarehousesSpeedOftenneedtoreacttoevolvingusagepatternsinreal-time(e.g.,merchandising)WebMiningtopicsWebgraphanalysisPowerLawsandTheLongTailStructureddataextractionWebadvertisingSystemsIssuesWebMiningtopicsWebgraphanalysisPowerLawsandTheLongTailStructureddataextractionWebadvertisingSystemsIssuesSizeoftheWebNumberofpagesTechnically,infiniteMuchduplication(30-40%)Bestestimateof“unique”staticHTMLpagescomesfromsearchengineclaimsUntillastyear,Googleclaimed8billion(?),Yahooclaimed20billionGooglerecentlyannouncedthattheirindexcontains1trillionpagesHowtoexplainthediscrepancy?ThewebasagraphPages=nodes,hyperlinks=edgesIgnorecontentDirectedgraphHighlinkage10-20links/pageonaveragePower-lawdegreedistributionStructureofWebgraphLet’stakeacloserlookatstructureBroderetal(2000)studiedacrawlof200MpagesandothersmallercrawlsBow-tiestructureNota“smallworld”Bow-tieStructureSource:Broderetal,2000Whatcanthegraphtellus?Distinguish“important”pagesfromunimportantonesPagerankDiscovercommunitiesofrelatedpagesHubsandAuthoritiesDetectwebspamTrustrankWebMiningtopicsWebgraphanalysisPowerLawsandTheLongTailStructureddataextractionWebadvertisingSystemsIssuesPower-lawdegreedistributionSource:Broderetal,2000Power-lawsgaloreStructureIn-degreesOut-degreesNumberofpagespersiteUsagepatternsNumberofvisitorsPopularitye.g.,products,movies,musicTheLongTailSource:ChrisAnderson(2004)TheLongTailShelfspaceisascarcecommodityfortraditionalretailersAlso:TVnetworks,movietheaters,…Thewebenablesnear-zero-costdisseminationofinformationaboutproductsMorechoicenecessitatesbetterfiltersRecommendationengines(e.g.,Amazon)HowIntoThinAirmadeTouchingtheVoidabestsellerWebMiningtopicsWebgraphanalysisPowerLawsandTheLongTailStructureddataextractionWebadvertisingSystemsIssuesExtractingStructuredDataWebgraphanalysisPowerLawsandTheLongTailStructureddataextractionWebadvertisingSystemsIssuesAdsvs.searchresultsAdsvs.searchresultsSearchadvertisingistherevenuemodelMulti-billion-dollarindustryAdvertiserspayforclicksontheiradsInterestingproblemsWhatadstoshowforasearch?IfI’manadvertiser,whichsearchtermsshouldIbidonandhowmuchtobid?WebMiningtopicsWebgraphanalysisPowerLawsandTheLongTailStructureddataextractionWebadvertisingSystemsIssuesTwoApproachestoAnalyzingDataMachineLearningapproachEmphasizessophisticatedalgorithmse.g.,SupportVectorMachinesDatasetstendtobesmall,fitinmemoryDataMiningapproachEmphasizesbigdatasets(e.g.,intheterabytes)Datacannotevenfitonasingledisk!NecessarilyleadstosimpleralgorithmsPhilosophyInmanycases,addingmoredataleadstobetterresultsthatimprovingalgorithmsNetflixGooglesearchGoogleadsMoreonmyblog:Datawocky(datawocky.com)SystemsarchitectureMemoryDiskCPUMachineLearning,Statistics“Classical”DataMiningVeryLarge-ScaleDataMiningMemDiskCPUMemDiskCPUMemDiskCPU…ClusterofcommoditynodesSystemsIssuesWebdatasetscanbeverylargeTenstohundredsofterabytesCannotmineonasingleserver!NeedlargefarmsofserversHowtoorganizehardware/softwaretominemulti-terabyedatasetsWithoutbreakingthebank!WebMiningtopicsWebgraphanalysisPowerLawsandTheLongTailStructureddataextractionWebadvertisingSystemsIssuesProjectLotsofinterestingprojectideasIfyoucan’tthinkofonepleasecomediscusswithusInfrastructureAsterDataclusteronAmazonEC2SupportsbothMapReduceandSQLDataNetflixShareThisGoogleWebBaseTREC

1 / 29
下载文档,编辑使用

©2015-2020 m.777doc.com 三七文档.

备案号:鲁ICP备2024069028号-1 客服联系 QQ:2149211541

×
保存成功