August1,2019DataMining:ConceptsandTechniques1数据仓库与数据挖掘August1,2019DataMining:ConceptsandTechniques2DataMining:ConceptsandTechniques—Chapter1——Introduction—JiaweiHanandMichelineKamberDepartmentofComputerScienceUniversityofIllinoisatUrbana-Champaignwww.cs.uiuc.edu/~hanj©2006JiaweiHanandMichelineKamber.Allrightsreserved.August1,2019DataMining:ConceptsandTechniques3August1,2019DataMining:ConceptsandTechniques4August1,2019DataMining:ConceptsandTechniques5Chapter1.IntroductionMotivation:Whydatamining?Whatisdatamining?DataMining:Onwhatkindofdata?DataminingfunctionalityAreAllthe“Discovered”PatternsInteresting?ClassificationofdataminingsystemsMajorissuesindataminingOverviewofthecourseSupplementaryLectureSlidesAugust1,2019DataMining:ConceptsandTechniques6WhyDataMining?TheExplosiveGrowthofData:fromterabytestopetabytesDatacollectionanddataavailabilityAutomateddatacollectiontools,databasesystems,Web,computerizedsocietyMajorsourcesofabundantdataBusiness:Web,e-commerce,transactions,stocks,…Science:Remotesensing,bioinformatics,scientificsimulation,…Societyandeveryone:news,digitalcameras,YouTubeWearedrowningindata,butstarvingforknowledge!“Necessityisthemotherofinvention”—Datamining—AutomatedanalysisofmassivedatasetsAugust1,2019DataMining:ConceptsandTechniques7EvolutionofSciencesBefore1600,empiricalscience1600-1950s,theoreticalscienceEachdisciplinehasgrownatheoreticalcomponent.Theoreticalmodelsoftenmotivateexperimentsandgeneralizeourunderstanding.1950s-1990s,computationalscienceOverthelast50years,mostdisciplineshavegrownathird,computationalbranch(e.g.empirical,theoretical,andcomputationalecology,orphysics,orlinguistics.)ComputationalSciencetraditionallymeantsimulation.Itgrewoutofourinabilitytofindclosed-formsolutionsforcomplexmathematicalmodels.1990-now,datascienceThefloodofdatafromnewscientificinstrumentsandsimulationsTheabilitytoeconomicallystoreandmanagepetabytesofdataonlineTheInternetandcomputingGridthatmakesallthesearchivesuniversallyaccessibleScientificinfo.management,acquisition,organization,query,andvisualizationtasksscalealmostlinearlywithdatavolumes.Dataminingisamajornewchallenge!JimGrayandAlexSzalay,TheWorldWideTelescope:AnArchetypeforOnlineScience,Comm.ACM,45(11):50-54,Nov.2002August1,2019DataMining:ConceptsandTechniques8EvolutionofDatabaseTechnology1960s:Datacollection,databasecreation,IMSandnetworkDBMS1970s:Relationaldatamodel,relationalDBMSimplementation1980s:RDBMS,advanceddatamodels(extended-relational,OO,deductive,etc.)Application-orientedDBMS(spatial,scientific,engineering,etc.)1990s:Datamining,datawarehousing,multimediadatabases,andWebdatabases2000sStreamdatamanagementandminingDatamininganditsapplicationsWebtechnology(XML,dataintegration)andglobalinformationsystemsAugust1,2019DataMining:ConceptsandTechniques9WhyNotTraditionalDataAnalysis?TremendousamountofdataAlgorithmsmustbehighlyscalabletohandlesuchastera-bytesofdataHigh-dimensionalityofdataMicro-arraymayhavetensofthousandsofdimensionsHighcomplexityofdataDatastreamsandsensordataTime-seriesdata,temporaldata,sequencedataStructuredata,graphs,socialnetworksandmulti-linkeddataHeterogeneousdatabasesandlegacydatabasesSpatial,spatiotemporal,multimedia,textandWebdataSoftwareprograms,scientificsimulationsAugust1,2019DataMining:ConceptsandTechniques10WhyDataMining?—PotentialApplicationsDataanalysisanddecisionsupportMarketanalysisandmanagementTargetmarketing,customerrelationshipmanagement(CRM),marketbasketanalysis,crossselling,marketsegmentationRiskanalysisandmanagementForecasting,customerretention,improvedunderwriting,qualitycontrol,competitiveanalysisFrauddetectionanddetectionofunusualpatterns(outliers)OtherApplicationsTextmining(newsgroup,email,documents)andWebminingStreamdataminingBioinformaticsandbio-dataanalysis2019年8月1日星期四DataMining:ConceptsandTechniques11WhatIsDataMining?Datamining(knowledgediscoveryfromdata)Extractionofinteresting(non-trivial,implicit,previouslyunknownandpotentiallyuseful)patternsorknowledgefromhugeamountofdatathenon-trivialprocessofidentifyingvalid,novel,potentiallyusefulandultimatelyunderstandablepatternsindata(从数据集中识别出有效的、新颖的、潜在有用的,以及最终可理解的模式的非平凡过程)AlternativenamesKnowledgediscovery(mining)indatabases(KDD),knowledgeextraction,data/patternanalysis,dataarcheology,datadredging,etc.August1,2019DataMining:ConceptsandTechniques12KnowledgeDiscovery(KDD)ProcessDatamining—coreofknowledgediscoveryprocessDataCleaningDataIntegrationDatabasesDataWarehouseTask-relevantDataSelectionDataMiningPatternEvaluationAugust1,2019DataMining:ConceptsandTechniques13KDDProcess:SeveralKeyStepsLearningtheapplicationdomainrelevantpriorknowledgeandgoalsofapplicationCreatingatargetdataset:dataselectionDatacleaningandpreprocessing:(maytake60%ofeffort!)DatareductionandtransformationFindusefulfeatures,dimensionality/variablereduction,invariantrepresentationChoosingfunctionsofdataminingsummarization,classification,regression,association,clusteringChoosingtheminingalgorithm(s)Datamining:searchfor