大数据数据挖掘培训讲义1:机器学习数据挖掘知识发现简介

整理文档很辛苦,赏杯茶钱您下走!

免费阅读已结束,点击下载阅读编辑剩下 ...

阅读已结束,您可以下载文档离线阅读编辑

资源描述

MachineLearning,DataMining,andKnowledgeDiscovery:AnIntroductionGregoryPiatetsky-ShapiroKDnuggets2CourseOutlineMachineLearninginput,representation,decisiontreesWekamachinelearningworkbenchDataMiningassociations,deviationdetection,clustering,visualizationCaseStudiestargetedmarketing,genomicmicroarraysDataMining,PrivacyandSecurityFinalProject:MicroarrayDataMiningCompetition3LessonOutlineIntroduction:DataFloodDataMiningApplicationExamplesDataMining&KnowledgeDiscoveryDataMiningTasks4TrendsleadingtoDataFloodMoredataisgenerated:Bank,telecom,otherbusinesstransactions...Scientificdata:astronomy,biology,etcWeb,text,ande-commerce5BigDataExamplesEurope'sVeryLongBaselineInterferometry(VLBI)has16telescopes,eachofwhichproduces1Gigabit/secondofastronomicaldataovera25-dayobservationsessionstorageandanalysisabigproblemAT&Thandlesbillionsofcallsperdaysomuchdata,itcannotbeallstored--analysishastobedone“onthefly”,onstreamingdata6Largestdatabasesin2003Commercialdatabases:WinterCorp.2003Survey:FranceTelecomhaslargestdecision-supportDB,~30TB;AT&T~26TBWebAlexainternetarchive:7yearsofdata,500TBGooglesearches4+Billionpages,manyhundredsTBIBMWebFountain,160TB(2003)InternetArchive(),~300TB7Fromterabytestoexabytesto…UCBerkeley2003estimate:5exabytes(5millionterabytes)ofnewdatawascreatedin2002.USproduces~40%ofnewstoreddataworldwide2006estimate:161exabytes(IDCstudy)2010projection:988exabytes8LargestDatabasesin2005WinterCorp.2005CommercialDatabaseSurvey:1.MaxPlanckInst.forMeteorology,222TB2.Yahoo~100TB(LargestDataWarehouse)3.AT&T~94TB!10DataGrowthRateTwiceasmuchinformationwascreatedin2002asin1999(~30%growthrate)OthergrowthrateestimatesevenhigherVerylittledatawilleverbelookedatbyahumanKnowledgeDiscoveryisNEEDEDtomakesenseanduseofdata.11LessonOutlineIntroduction:DataFloodDataMiningApplicationExamplesDataMining&KnowledgeDiscoveryDataMiningTasks12MachineLearning/DataMiningApplicationareasScienceastronomy,bioinformatics,drugdiscovery,…BusinessCRM(CustomerRelationshipmanagement),frauddetection,e-commerce,manufacturing,sports/entertainment,telecom,targetedmarketing,healthcare,…Web:searchengines,advertising,webandtextmining,…Governmentsurveillance(?|),crimedetection,profilingtaxcheaters,…13ApplicationAreasWhatdoyouthinkaresomeofthemostimportantandwidespreadbusinessapplicationsofDataMining?14DataMiningforCustomerModelingCustomerTasks:attritionpredictiontargetedmarketing:cross-sell,customeracquisitioncredit-riskfrauddetectionIndustriesbanking,telecom,retailsales,…15CustomerAttrition:CaseStudySituation:Attritionrateatformobilephonecustomersisaround25-30%ayear!Withthisinmind,whatisourtask?AssumewehavecustomerinformationforthepastNmonths.16CustomerAttrition:CaseStudyTask:Predictwhoislikelytoattritenextmonth.Estimatecustomervalueandwhatisthecost-effectiveoffertobemadetothiscustomer.17CustomerAttritionResultsVerizonWirelessbuiltacustomerdatawarehouseIdentifiedpotentialattritersDevelopedmultiple,regionalmodelsTargetedcustomerswithhighpropensitytoaccepttheofferReducedattritionratefromover2%/monthtounder1.5%/month(hugeimpact,with30Msubscribers)(Reportedin2003)18AssessingCreditRisk:CaseStudySituation:PersonappliesforaloanTask:Shouldabankapprovetheloan?Note:Peoplewhohavethebestcreditdon’tneedtheloans,andpeoplewithworstcreditarenotlikelytorepay.Bank’sbestcustomersareinthemiddle19CreditRisk-ResultsBanksdevelopcreditmodelsusingvarietyofmachinelearningmethods.MortgageandcreditcardproliferationaretheresultsofbeingabletosuccessfullypredictifapersonislikelytodefaultonaloanWidelydeployedinmanycountries20e-commerceApersonbuysabook(product)atAmazon.comWhatisthetask?21Successfule-commerce–CaseStudyTask:Recommendotherbooks(products)thispersonislikelytobuyAmazondoesclusteringbasedonbooksbought:customerswhobought“AdvancesinKnowledgeDiscoveryandDataMining”,alsobought“DataMining:PracticalMachineLearningToolsandTechniqueswithJavaImplementations”Recommendationprogramisquitesuccessful22Unsuccessfule-commercecasestudy(KDD-Cup2000)Data:clickstreamandpurchasedatafromGazelle.com,legwearandlegcaree-tailerQ:Characterizevisitorswhospendmorethan$12onanaverageorderatthesiteDatasetof3,465purchases,1,831customersVeryinterestinganalysisbyCupparticipantsthousandsofhours-$X,000,000(Millions)ofconsultingTotalsales--$Y,000Obituary:Gazelle.comoutofbusiness,Aug200023GenomicMicroarrays–CaseStudyGivenmicroarraydataforanumberofsamples(patients),canweAccuratelydiagnosethedisease?Predictoutcomeforgiventreatment?Recommendbesttreatment?24Example:ALL/AMLdata38trainingcases,34test,~7,000genes2Classes:AcuteLymphoblasticLeukemia(ALL)vsAcuteMyeloidLeukemia(AML)UsetraindatatobuilddiagnosticmodelALLAMLResultsontestdata:33/34correct,1errormaybemislabeled25SecurityandFraudDetection-Case

1 / 39
下载文档,编辑使用

©2015-2020 m.777doc.com 三七文档.

备案号:鲁ICP备2024069028号-1 客服联系 QQ:2149211541

×
保存成功