LOGODataMining:Theory&AlgorithmsMining?Warehousing?5TechnologyAdvancement6TechnologyAdvancement7TheWorldofData8DataRich,InformationPoor910LearningResources11InternationalConferenceonDataMiningInternationalConferenceonDataEngineeringInternationalConferenceonMachineLearningInternationalJointConferenceonArtificialIntelligencePacific-AsiaConferenceonKnowledgeDiscoveryandDataMiningACMSIGKDDConferenceonKnowledgeDiscoveryandDataMiningLearningResources12LearningResources13XindongWuZhihuaZhouJiaweiHanJianPeiQiangYangChih-JenLinPhilipS.YuChangshuiZhangLearningResources14Interdisciplinary15DataMiningMachineLearningPatternRecognitionStatisticsArtificialIntelligenceUbiquitous16DataMiningBusinessIntelligenceDataAnalyticsBigDataDecisionSupportCustomerRelationshipManagementComprehensiveLearning17ClassTeaching•Thinking•DiscussionReadingMaterials•Extension•InspirationPractice•Techniques•ApplicationsLearning≠Listening1820DataDefinition“Dataarepiecesofinformationthatrepresentthequalitativeorquantitativeattributesofavariableorsetofvariables.Dataareoftenviewedasthelowestlevelofabstractionfromwhichinformationandknowledgearederived.”DataTypesContinuous,BinaryDiscrete,StringSymbolicStoragePhysicalLogicalMajorIssuesTransformationErrorsandCorruption21WhatisBigData?“Bigdataishigh-volume,high-velocityandhigh-varietyinformationassetsthatdemandcost-effective,innovativeformsofinformationprocessingforenhancedinsightanddecisionmaking.”—Gartner“Bigdatareferstodatasetswhosesizeisbeyondtheabilityoftypicaldatabasesoftwaretoolstocapture,store,manage,andanalyze.”—Mckinsey&Company22BigData23PublicSecurity24HealthCareApplication25EffectivenessResearchPersonalizedMedicineLocationData:UrbanPlanning26LocationData:MobileUser27LocationData:Shopper28RetailData:TargetedMarketing29RetailData:SentimentAnalysis30SocialNetworks31Sports32AttractivenessMining3334OpenDataTechnicallyOpen:availableinamachine-readablestandardformat,whichmeansitcanberetrievedandmeaningfullyprocessedbyacomputerapplication.LegallyOpen:explicitlylicensedinawaythatpermitscommercialandnon-commercialusewithoutrestrictions.35Wheretofinddata?36OpenGovernmentData37DataMiningPeoplehavebeenanalysingandinvestigatingdataforcenturies.StatisticsMean,Variance,Correlation,Distribution…Inmoderndays,dataareoftenfarbeyondhumancomprehension.Diversity,Volume,DimensionalityDefinitionDataMiningistheprocessofautomaticallyextractinginterestingandusefulhiddenpatternsfromusuallymassive,incompleteandnoisydata.NotafullyautomaticprocessHumaninterventionsareofteninevitable.DomainKnowledgeDataCollectionandPre-processingSynonym:KnowledgeDiscovery38“Ifyouarelookingforacareerwhereyourserviceswillbeinhighdemand,youshouldfindsomethingwhereyouprovideascarce,complementaryservicetosomethingthatisgettingubiquitousandcheap.Sowhat’sgettingubiquitousandcheap?Data.Andwhatiscomplementarytodata?Analysis.Somyrecommendationistotakelotsofcoursesabouthowtomanipulateandanalyzedata:databases,machinelearning,econometrics,statistics,visualization,andsoon.”AninterviewwithGoogleChiefEconomistHalVarianfromtheNewYorkTimesIsDMreallyimportant?3940BusinessIntelligence41FromDataToIntelligence42DecisionModelsDataMiningPreprocessingDatabaseDecisionSupportKnowledgeInformationDataDataIntegration&Analysis43TheProcessofDataMining44454647DMTechniques-Classification“Classificationisaprocedureinwhichindividualitemsareplacedintogroupsbasedonquantitativeinformationononeormorecharacteristics(referredtoasvariables)andbasedonatrainingsetofpreviouslylabeleditems.”Givenatrainingset:{(x1,y1),…,(xn,yn)},produceaclassifier(function)thatmapsanyunknownobjectxitoitsclasslabelyi.AlgorithmsDecisionTreesK-NearestNeighboursNeuralNetworksSupportVectorMachinesApplicationsChurnPredictionMedicalDiagnosis48XYClassificationBoundaries49??XYIncomeSavingsLowRiskHighRiskOverfitting–Classification50CrossValidation51DataTrainingSetTestSetEvaluationGeneratedModelsConfusionMatrix52ConfusionMatrixActualValuePositiveNegativeTotalPredictedValuePositiveTruePositiveFalsePositiveP'NegativeFalseNegativeTrueNegativeN'TotalPNTPR=TP/(TP+FN)TNR=TN/(TN+FP)Accuracy=(TP+TN)/(P+N)ReceiverOperatingCharacteristic53VerysmallthresholdVerylargethresholdRandomguessCostSensitiveLearning54LiftAnalysis5556DMTechniques-Clustering“Clusteringistheassignmentofasetofobservationsintosubsets(calledclusters)sothatobservationsinthesameclusteraresimilarinsomesense.”DistanceMetricsEuclideanDistanceManhattanDistanceMahalanobisDistanceAlgorithmsK-MeansSequentialLeaderAffinityPropagationApplicationsMarketResearchImageSegmentationSocialNetworkAnalysis57Whatisthedifferencebetweenclassificationandclustering?HierarchicalClustering58DMTechniques–AssociationRule59AssociationRule60TransactionIDMilkBreadButterBeer1110020110300014111050100ButterBreadMilk,DMTechniques–Regression61,XfYkkzkkxxzeyxxyxxyxy110110221010,11Overfitting–Regression62yxSeeingisKnowing63PerformanceDashboard6465DataPreprocessingRealdataareoftensurprisinglydirty.AMajorCha