CS345ADataMiningLecture1IntroductiontoWebMiningWhatisWebMining?DiscoveringusefulinformationfromtheWorld-WideWebanditsusagepatternsWebMiningv.DataMiningStructure(orlackofit)TextualinformationandlinkagestructureScaleDatageneratedperdayiscomparabletolargestconventionaldatawarehousesSpeedOftenneedtoreacttoevolvingusagepatternsinreal-time(e.g.,merchandising)WebMiningtopicsWebgraphanalysisPowerLawsandTheLongTailStructureddataextractionWebadvertisingSystemsIssuesWebMiningtopicsWebgraphanalysisPowerLawsandTheLongTailStructureddataextractionWebadvertisingSystemsIssuesSizeoftheWebNumberofpagesTechnically,infiniteMuchduplication(30-40%)Bestestimateof“unique”staticHTMLpagescomesfromsearchengineclaimsUntillastyear,Googleclaimed8billion(?),Yahooclaimed20billionGooglerecentlyannouncedthattheirindexcontains1trillionpagesHowtoexplainthediscrepancy?ThewebasagraphPages=nodes,hyperlinks=edgesIgnorecontentDirectedgraphHighlinkage10-20links/pageonaveragePower-lawdegreedistributionStructureofWebgraphLet’stakeacloserlookatstructureBroderetal(2000)studiedacrawlof200MpagesandothersmallercrawlsBow-tiestructureNota“smallworld”Bow-tieStructureSource:Broderetal,2000Whatcanthegraphtellus?Distinguish“important”pagesfromunimportantonesPagerankDiscovercommunitiesofrelatedpagesHubsandAuthoritiesDetectwebspamTrustrankWebMiningtopicsWebgraphanalysisPowerLawsandTheLongTailStructureddataextractionWebadvertisingSystemsIssuesPower-lawdegreedistributionSource:Broderetal,2000Power-lawsgaloreStructureIn-degreesOut-degreesNumberofpagespersiteUsagepatternsNumberofvisitorsPopularitye.g.,products,movies,musicTheLongTailSource:ChrisAnderson(2004)TheLongTailShelfspaceisascarcecommodityfortraditionalretailersAlso:TVnetworks,movietheaters,…Thewebenablesnear-zero-costdisseminationofinformationaboutproductsMorechoicenecessitatesbetterfiltersRecommendationengines(e.g.,Amazon)HowIntoThinAirmadeTouchingtheVoidabestsellerWebMiningtopicsWebgraphanalysisPowerLawsandTheLongTailStructureddataextractionWebadvertisingSystemsIssuesExtractingStructuredDataWebgraphanalysisPowerLawsandTheLongTailStructureddataextractionWebadvertisingSystemsIssuesAdsvs.searchresultsAdsvs.searchresultsSearchadvertisingistherevenuemodelMulti-billion-dollarindustryAdvertiserspayforclicksontheiradsInterestingproblemsWhatadstoshowforasearch?IfI’manadvertiser,whichsearchtermsshouldIbidonandhowmuchtobid?WebMiningtopicsWebgraphanalysisPowerLawsandTheLongTailStructureddataextractionWebadvertisingSystemsIssuesTwoApproachestoAnalyzingDataMachineLearningapproachEmphasizessophisticatedalgorithmse.g.,SupportVectorMachinesDatasetstendtobesmall,fitinmemoryDataMiningapproachEmphasizesbigdatasets(e.g.,intheterabytes)Datacannotevenfitonasingledisk!NecessarilyleadstosimpleralgorithmsPhilosophyInmanycases,addingmoredataleadstobetterresultsthatimprovingalgorithmsNetflixGooglesearchGoogleadsMoreonmyblog:Datawocky(datawocky.com)SystemsarchitectureMemoryDiskCPUMachineLearning,Statistics“Classical”DataMiningVeryLarge-ScaleDataMiningMemDiskCPUMemDiskCPUMemDiskCPU…ClusterofcommoditynodesSystemsIssuesWebdatasetscanbeverylargeTenstohundredsofterabytesCannotmineonasingleserver!NeedlargefarmsofserversHowtoorganizehardware/softwaretominemulti-terabyedatasetsWithoutbreakingthebank!WebMiningtopicsWebgraphanalysisPowerLawsandTheLongTailStructureddataextractionWebadvertisingSystemsIssuesProjectLotsofinterestingprojectideasIfyoucan’tthinkofonepleasecomediscusswithusInfrastructureAsterDataclusteronAmazonEC2SupportsbothMapReduceandSQLDataNetflixShareThisGoogleWebBaseTREC