Chapter12DataMiningTechnologyanditsApplicationDataMiningTechnologyBasicResearchApplicationResearch12.112.312.212.1DataMiningTechnology12.1.1EvolutionofDataMining12.1.2Whatisdatamining?12.1.3Dataminingfunction12.1.1EvolutionofDataMining•1960s:Datacollection,databasecreation,IMSandnetworkDBMS•1970s:Relationaldatamodel,relationalDBMSimplementation•1980s:RDBMS,advanceddatamodels(extended-relational,OO,deductive,etc.)Application-orientedDBMS(spatial,scientific,engineering,etc.)•1990s:–Datamining,datawarehousing,multimediadatabases,andWebdatabases•2000s–Streamdatamanagementandmining–Datamininganditsapplications–Webtechnology(XML,dataintegration)andglobalinformationsystems12.1.2WhatIsDataMining?(1)definition•Dataminingisthenon-trivialprocessofidentifyingvalid,novel,potentiallyuseful,andultimatelyunderstandablepatternsfromhugevolumeofdata.•数据挖掘是从巨量数据中获取有效的、新颖的、潜在有用的、最终可以理解的模式的非平凡过程。12.1.2WhatIsDataMining?(2)KnowledgeDiscovery(KDD)Process–Datamining—coreofknowledgediscoveryprocessDataCleaningDataIntegrationDatabasesDataWarehouseTask-relevantDataSelectionDataMiningPatternEvaluation12.1.2WhatIsDataMining?(3)DataMiningandBusinessIntelligenceIncreasingpotentialtosupportbusinessdecisionsEndUserBusinessAnalystDataAnalystDBADecisionMakingDataPresentationVisualizationTechniquesDataMiningInformationDiscoveryDataExplorationStatisticalSummary,Querying,andReportingDataPreprocessing/Integration,DataWarehousesDataSourcesPaper,Files,Webdocuments,Scientificexperiments,DatabaseSystems12.1.2WhatIsDataMining?(4)DataMining:ConfluenceofMultipleDisciplinesDataMiningDatabaseTechnologyStatisticsMachineLearningPatternRecognitionAlgorithmOtherDisciplinesVisualization12.1.3DataMiningFunctionsClustering,Outlier,Association,CorrelationanalysisClassification,PredictionTrend,EvolutionanalysisCharacterization,Discrimination,Andetc.12.1.3DataMiningFunctions(1)Cluster•Clusteranalysis–Unsupervisedlearning(i.e.,Classlabelisunknown)–Groupdatatoformnewcategories(i.e.,clusters),e.g.,clusterhousestofinddistributionpatterns–Principle:Maximizingintra-classsimilarity&minimizinginterclasssimilarity–Manymethodsandapplications12.1.3DataMiningFunctions(1)ClusterFindinggroupsofobjectssuchthattheobjectsinagroupwillbesimilar(orrelated)tooneanotheranddifferentfrom(orunrelatedto)theobjectsinothergroupsInter-clusterdistancesaremaximizedIntra-clusterdistancesareminimized12.1.3DataMiningFunctions(2)OutlierAnalysis•Outlieranalysis–Outlier:Adataobjectthatdoesnotcomplywiththegeneralbehaviorofthedata–Noiseorexception?―Oneperson’sgarbagecouldbeanotherperson’streasure–Methods:byproductofclusteringorregressionanalysis,…–Usefulinfrauddetection,rareeventsanalysis12.1.3DataMiningFunctions(2)OutlierAnalysisOutliersaredataobjectswithcharacteristicsthatareconsiderablydifferentthanmostoftheotherdataobjectsinthedataset12.1.3DataMiningFunctions(3)AssociationandCorrelationAnalysis•Frequentpatterns(orfrequentitemsets)–WhatitemsarefrequentlypurchasedtogetherinyourWalmart?•Association,correlationvs.causality–Atypicalassociationrule–Arestronglyassociateditemsalsostronglycorrelated?•Howtominesuchpatternsandrulesefficientlyinlargedatasets?•Howtousesuchpatternsforclassification,clustering,andotherapplications?12.1.3DataMiningFunctions(3)AssociationandCorrelationAnalysisDiaperÆBeer[0.5%,75%](support,confidence)TIDItems1Bread,Coke,Milk2Beer,Bread3Beer,Coke,Diaper,Milk4Beer,Bread,Diaper,Milk5Coke,Diaper,MilkRulesDiscovered:{Milk}--{Coke}{Diaper,Milk}--{Beer}12.1.3DataMiningFunctions(4)ClassificationandPrediction•Classificationandprediction–Constructmodels(functions)basedonsometrainingexamples–Describeanddistinguishclassesorconceptsforfutureprediction•E.g.,classifycountriesbasedon(climate),orclassifycarsbasedon(gasmileage)–Predictsomeunknownormissingnumericalvalues•Typicalmethods–Decisiontrees,supportvectormachines,neuralnetworks,rule-basedclassification,pattern-basedclassification,logisticregression,…•Typicalapplications:–Creditcardfrauddetection,directmarketing,classifyingstars,diseases,web-pages,…12.1.3DataMiningFunctions(5)TrendandEvolutionAnalysis•Sequence,trendandevolutionanalysis–Trendanddeviationanalysis:e.g.,regression–Sequentialpatternmining•e.g.,firstbuydigitalcamera,thenlargeSDmemorycards–Periodicityanalysis–Motifs,time-series,andbiologicalsequenceanalysis•Approximateandconsecutivemotifs–Similarity-basedanalysis•Miningdatastreams–Ordered,time-varying,potentiallyinfinite,datastreams12.2BasicResearch12.2.1DataPreprocessing12.2.2DevelopingAlgorithms12.2.3LimitationsofClusteringAlgorithms12.2.1DataPreprocessing•Datacleaning•Dataintegrationandtransformation•Datareduction•AttributesWeighting•…12.2.1DataPreprocessing(paper2-1)AttributesWeighting•Datapre-processingisimportantforsuccessfuldatamining,bymakingthedatamoreamenableforthedataminingprocess.•Theattributesweighting(featureweighting)isonedatapre-processingmethod,anditisanalternativetokeepingoreliminatingfeaturesintheapplicationsofdataminingtechniques.12.2.1DataPreprocessing(1)AttributesWeightingclusterwithoutusingweightedattributesclus