MachineLearning,DataMining,andKnowledgeDiscovery:AnIntroductionGregoryPiatetsky-ShapiroKDnuggets2CourseOutlineMachineLearninginput,representation,decisiontreesWekamachinelearningworkbenchDataMiningassociations,deviationdetection,clustering,visualizationCaseStudiestargetedmarketing,genomicmicroarraysDataMining,PrivacyandSecurityFinalProject:MicroarrayDataMiningCompetition3LessonOutlineIntroduction:DataFloodDataMiningApplicationExamplesDataMining&KnowledgeDiscoveryDataMiningTasks4TrendsleadingtoDataFloodMoredataisgenerated:Bank,telecom,otherbusinesstransactions...Scientificdata:astronomy,biology,etcWeb,text,ande-commerce5BigDataExamplesEurope'sVeryLongBaselineInterferometry(VLBI)has16telescopes,eachofwhichproduces1Gigabit/secondofastronomicaldataovera25-dayobservationsessionstorageandanalysisabigproblemAT&Thandlesbillionsofcallsperdaysomuchdata,itcannotbeallstored--analysishastobedone“onthefly”,onstreamingdata6Largestdatabasesin2003Commercialdatabases:WinterCorp.2003Survey:FranceTelecomhaslargestdecision-supportDB,~30TB;AT&T~26TBWebAlexainternetarchive:7yearsofdata,500TBGooglesearches4+Billionpages,manyhundredsTBIBMWebFountain,160TB(2003)InternetArchive(),~300TB7Fromterabytestoexabytesto…UCBerkeley2003estimate:5exabytes(5millionterabytes)ofnewdatawascreatedin2002.USproduces~40%ofnewstoreddataworldwide2006estimate:161exabytes(IDCstudy)2010projection:988exabytes8LargestDatabasesin2005WinterCorp.2005CommercialDatabaseSurvey:1.MaxPlanckInst.forMeteorology,222TB2.Yahoo~100TB(LargestDataWarehouse)3.AT&T~94TB!10DataGrowthRateTwiceasmuchinformationwascreatedin2002asin1999(~30%growthrate)OthergrowthrateestimatesevenhigherVerylittledatawilleverbelookedatbyahumanKnowledgeDiscoveryisNEEDEDtomakesenseanduseofdata.11LessonOutlineIntroduction:DataFloodDataMiningApplicationExamplesDataMining&KnowledgeDiscoveryDataMiningTasks12MachineLearning/DataMiningApplicationareasScienceastronomy,bioinformatics,drugdiscovery,…BusinessCRM(CustomerRelationshipmanagement),frauddetection,e-commerce,manufacturing,sports/entertainment,telecom,targetedmarketing,healthcare,…Web:searchengines,advertising,webandtextmining,…Governmentsurveillance(?|),crimedetection,profilingtaxcheaters,…13ApplicationAreasWhatdoyouthinkaresomeofthemostimportantandwidespreadbusinessapplicationsofDataMining?14DataMiningforCustomerModelingCustomerTasks:attritionpredictiontargetedmarketing:cross-sell,customeracquisitioncredit-riskfrauddetectionIndustriesbanking,telecom,retailsales,…15CustomerAttrition:CaseStudySituation:Attritionrateatformobilephonecustomersisaround25-30%ayear!Withthisinmind,whatisourtask?AssumewehavecustomerinformationforthepastNmonths.16CustomerAttrition:CaseStudyTask:Predictwhoislikelytoattritenextmonth.Estimatecustomervalueandwhatisthecost-effectiveoffertobemadetothiscustomer.17CustomerAttritionResultsVerizonWirelessbuiltacustomerdatawarehouseIdentifiedpotentialattritersDevelopedmultiple,regionalmodelsTargetedcustomerswithhighpropensitytoaccepttheofferReducedattritionratefromover2%/monthtounder1.5%/month(hugeimpact,with30Msubscribers)(Reportedin2003)18AssessingCreditRisk:CaseStudySituation:PersonappliesforaloanTask:Shouldabankapprovetheloan?Note:Peoplewhohavethebestcreditdon’tneedtheloans,andpeoplewithworstcreditarenotlikelytorepay.Bank’sbestcustomersareinthemiddle19CreditRisk-ResultsBanksdevelopcreditmodelsusingvarietyofmachinelearningmethods.MortgageandcreditcardproliferationaretheresultsofbeingabletosuccessfullypredictifapersonislikelytodefaultonaloanWidelydeployedinmanycountries20e-commerceApersonbuysabook(product)atAmazon.comWhatisthetask?21Successfule-commerce–CaseStudyTask:Recommendotherbooks(products)thispersonislikelytobuyAmazondoesclusteringbasedonbooksbought:customerswhobought“AdvancesinKnowledgeDiscoveryandDataMining”,alsobought“DataMining:PracticalMachineLearningToolsandTechniqueswithJavaImplementations”Recommendationprogramisquitesuccessful22Unsuccessfule-commercecasestudy(KDD-Cup2000)Data:clickstreamandpurchasedatafromGazelle.com,legwearandlegcaree-tailerQ:Characterizevisitorswhospendmorethan$12onanaverageorderatthesiteDatasetof3,465purchases,1,831customersVeryinterestinganalysisbyCupparticipantsthousandsofhours-$X,000,000(Millions)ofconsultingTotalsales--$Y,000Obituary:Gazelle.comoutofbusiness,Aug200023GenomicMicroarrays–CaseStudyGivenmicroarraydataforanumberofsamples(patients),canweAccuratelydiagnosethedisease?Predictoutcomeforgiventreatment?Recommendbesttreatment?24Example:ALL/AMLdata38trainingcases,34test,~7,000genes2Classes:AcuteLymphoblasticLeukemia(ALL)vsAcuteMyeloidLeukemia(AML)UsetraindatatobuilddiagnosticmodelALLAMLResultsontestdata:33/34correct,1errormaybemislabeled25SecurityandFraudDetection-Case