DataMiningwithR:learningbycasestudiesLuisTorgoLIACC-FEP,UniversityofPortoR.CampoAlegre,823-4150Porto,Portugalemail:ltorgo@liacc.up.pt∼ltorgoMay22,2003PrefaceThemaingoalofthisbookistointroducethereadertotheuseofRasatoolforperformingdatamining.Risafreelydownloadable1languageandenvironmentforstatisticalcomputingandgraphics.Itscapabilitiesandthelargesetofavailablepackagesmakethistoolanexcellentalternativetotheexisting(andexpensive!)dataminingtools.Oneofthekeyissuesindataminingissize.Atypicaldataminingprobleminvolvesalargedatabasefromwhereoneseekstoextractusefulknowledge.InthisbookwewilluseMySQLasthecoredatabasemanagementsystem.MySQLisalsofreelyavailable2forseveralcomputerplatforms.Thismeansthatyouwillbeabletoperform“serious”dataminingwithouthavingtopayanymoneyatall.Moreover,wehopetoshowyouthatthiscomeswithnocompromiseinthequalityoftheobtainedsolutions.Expensivetoolsdonotnecessarilymeanbettertools!RtogetherwithMySQLformapairveryhardtobeataslongasyouarewillingtospendsometimelearninghowtousethem.Wethinkthatitisworthwhile,andwehopethatyouareconvincedaswellattheendofreadingthisbook.Thegoalofthisbookisnottodescribeallfacetsofdataminingprocesses.Manybooksexistthatcoverthisarea.InsteadweproposetointroducethereadertothepowerofRanddataminingbymeansofseveralcasestudies.Obviously,thesecasestudiesdonotrepresentallpossibledataminingproblemsthatonecanfaceintherealworld.Moreover,thesolutionswedescribecannotbetakenascompletesolutions.OurgoalismoretointroducethereadertotheworldofdataminingusingRthroughpraticalexamples.AssuchouranalysisofthecasesstudieshasthegoalofshowingexamplesofknowledgeextractionusingR,insteadofpresentingcompletereportsofdataminingcasestudies.Theyshouldbetakenasexamplesofpossiblepathsinanydataminingprojectandcanbeusedasthebasisfordeveloppingsolutionsforthereader’sdataminingprojects.Still,wehavetriedtocoveradiversesetofproblemsposingdifferentchallengesintermsofsize,typeofdata,goalsofanalysisandtoolsthatarenecessarytocarryoutthisanalysis.WedonotassumeanypriorknowledgeaboutR.ReadersthatarenewtoRanddataminingshouldbeabletofollowthecasestudies.Wehavetriedtomakethedifferentcasestudiesself-containedinsuchawaythatthereadercanstartanywhereinthedocument.Still,somebasicRfunctionalitiesareintroducedinthefirst,simpler,casestudies,andarenotrepeated,whichmeansthatifyouarenewtoR,thenyoushouldatleaststartwiththefirstcase1DownloaditfromfirstchapterprovidesaveryshortintroductiontoRbasics,whichmayfacilitatetheunderstandingofthefollowingchapters.Wealsodonotassumeanyfamiliaritywithdataminingorstatisticaltechniques.Briefintroductionstodifferentmodelingapproachesareprovidedastheyarenecessaryinthecasestudies.Itisnotanobjectiveofthisbooktoprovidethereaderwithfullinformationonthetechnicalandtheoreticaldetailsofthesetechniques.Ourdescriptionsofthesemodelsaregiventoprovidebasicunderstandingontheirmerits,drawbacksandanalysisobjectives.Otherexistingbooksshouldbeconsiderediffurthertheoreticalinsightsarerequired.Attheendofsomesectionsweprovide“Furtherreadings”pointersforthereadersinterestedinknowingmoreonthetopics.Insummary,ourtargetreadersaremoreusersofdataanalysistoolsthanresearchersordevelopers.Still,wehopethelatteralsofindreadingthisbookusefulasaformofenteringthe“world”ofRanddatamining.ThebookisaccompaniedbyasetoffreelyavailableRsourcefilesthatcanbeobtainedatthebookWebsite3.Thesefilesincludeallthecodeusedinthecasestudies.Theyfacilitatethe“doityourself”philosophyfollowedinthisdocument.WestronglyrecommendthatreadersinstallRandtrythecodeastheyreadthebook.AlldatausedinthecasestudiesisavailableatthebookWebsiteaswell.3~ltorgo/DataMiningWithR/.(DRAFT-May22,2003)ContentsPrefaceiii1Introduction11.1Howtoreadthisbook?.......................21.2AshortintroductiontoR......................31.2.1StartingwithR........................31.2.2Robjects...........................51.2.3Vectors............................61.2.4Vectorization.........................81.2.5Factors............................91.2.6Generatingsequences....................111.2.7Indexing............................121.2.8Matricesandarrays.....................141.2.9Lists..............................171.2.10Dataframes..........................201.2.11Someusefulfunctions....................231.2.12Creatingnewfunctions...................251.2.13Managingyoursessions...................281.3AshortintroductiontoMySQL...................292PredictingAlgaeBlooms332.1Problemdescriptionandobjectives.................332.2DataDescription...........................342.3LoadingthedataintoR.......................342.4DataVisualizationandSummarization...............352.5Unknownvalues...........................422.5.1Removingtheobservationswithunknownvalues.....432.5.2Fillingintheunknownswiththemostfrequentvalues..442.5.3Fillingintheunknownvaluesbyexploringcorrelations.452.5.4Fillingintheunknownvaluesbyexploringsimilaritiesbetweencases.........................48