Thefieldofmachinelearningisconcernedwiththedevelopmentandapplicationofcomputeralgorithmsthatimprovewithexperience1.Machinelearningmeth-odshavebeenappliedtoabroadrangeofareaswithingeneticsandgenomics.Machinelearningisperhapsmostusefulfortheinterpretationoflargegenomicdatasetsandhasbeenusedtoannotateawidevarietyofgenomicsequenceelements.Forexample,machinelearningmethodscanbeusedto‘learn’howtorec-ognizethelocationsoftranscriptionstartsites(TSSs)inagenomesequence2.Algorithmscansimilarlybetrainedtoidentifysplicesites3,promoters4,enhancers5orpositionednucleosomes6.Ingeneral,ifonecancom-pilealistofsequenceelementsofagiventype,thenamachinelearningmethodcanprobablybetrainedtorecognizethoseelements.Furthermore,modelsthateachrecognizeanindividualtypeofgenomicelementcanbecombined,alongwith(learned)logicabouttheirrelativelocations,tobuildmachinelearningsys-temsthatarecapableofannotatinggenes—includ-ingtheiruntranslatedregions(UTRs),intronsandexons—alongentireeukaryoticchromosomes7.AswellaslearningtorecognizepatternsinDNAsequences,machinelearningalgorithmscanuseinputdatageneratedbyothergenomicassays—forexample,microarrayorRNAsequencing(RNA-seq)expressiondata;datafromchromatinaccessibilityassayssuchasDNase Ihypersensitivesitesequencing(DNase-seq),micrococcalnucleasedigestionfollowedbysequencing(MNase–seq)andformaldehyde-assistedisolationofregulatoryelementsfollowedbysequencing(FAIRE–seq);orchromatinimmunoprecipitationfollowedbysequencing(ChIP–seq)dataofhistonemodificationortranscriptionfactorbinding.Geneexpressiondatacanbeusedtolearntodistinguishbetweendifferentdis-easephenotypesand,intheprocess,toidentifypoten-tiallyvaluablediseasebiomarkers.Chromatindatacanbeused,forexample,toannotatethegenomeinanunsupervisedmanner,therebypotentiallyenablingtheidentificationofnewclassesoffunctionalelements.Machinelearningapplicationshavealsobeenexten-sivelyusedtoassignfunctionalannotationstogenes.SuchannotationsmostfrequentlytaketheformofGeneOntologytermassignments8.Inputofpredictivealgorithmscanbeanyoneormoreofawidevarietyofdata types,includingthegenomicsequence;geneexpressionprofilesacrossvariousexperimentalcondi-tionsorphenotypes;protein–proteininteractiondata;syntheticlethalitydata;openchromatindata;andChIP–seqdataofhistonemodificationortranscrip-tionfactorbinding.AsanalternativetoGeneOntologytermprediction,somepredictorsinsteadidentifyco-functionalrelationships,inwhichthemachinelearningmethodoutputsanetworkinwhichgenesarerepresentedasnodesandanedgebetweentwogenesindicatesthattheyhaveacommonfunction9.Finally,awidevarietyofmachinelearningmethodshavebeendevelopedtohelptounderstandthemecha-nismsunderlyinggeneexpression.Sometechniquesaimtopredicttheexpressionofageneonthebasisof1DepartmentofComputerScienceandEngineering,UniversityofWashington,185StevensWay,Seattle,Washington98195–2350,USA.2DepartmentofGenomeSciences,UniversityofWashington,372015thAveNESeattle,Washington98195–5065,USA.CorrespondencetoW.S.N. e‑mail:william‑noble@uw.edudoi:10.1038/nrg3920Publishedonline7May2015MachinelearningAfieldconcernedwiththedevelopmentandapplicationofcomputeralgorithmsthatimprovewithexperience.MachinelearningapplicationsingeneticsandgenomicsMaxwellW. Libbrecht1andWilliamStaffordNoble1,2Abstract|Thefieldofmachinelearning,whichaimstodevelopcomputeralgorithmsthatimprovewithexperience,holdspromisetoenablecomputerstoassisthumansintheanalysisoflarge,complexdatasets.Here,weprovideanoverviewofmachinelearningapplicationsfortheanalysisofgenomesequencingdatasets,includingtheannotationofsequenceelementsandepigenetic,proteomicormetabolomicdata.Wepresentconsiderationsandrecurrentchallengesintheapplicationofsupervised,semi-supervisedandunsupervisedmachinelearningmethods,aswellasofgenerativeanddiscriminativemodellingapproaches.Weprovidegeneralguidelinestoassistintheselectionofthesemachinelearningmethodsandtheirpracticalapplicationfortheanalysisofgeneticandgenomicdatasets.REVIEWSNATUREREVIEWS|GENETICSVOLUME16|JUNE2015|321©2015MacmillanPublishersLimited.AllrightsreservedArtificialintelligenceAfieldconcernedwiththedevelopmentofcomputeralgorithmsthatreplicatehumanskills,includinglearning,visualperceptionandnaturallanguageunderstanding.HeterogeneousdatasetsAcollectionofdatasetsfrommultiplesourcesorexperimentalmethodologies.Artefactualdifferencesbetweendatasetscanconfoundanalysis.LikelihoodTheprobabilityofadatasetgivenaparticularmodel.LabelThetargetofapredictiontask.Inclassification,thelabelisdiscrete(forexample,‘expressed’or’notexpressed’);inregression,thelabelisofrealvalue(forexample,ageneexpressionvalue).ExamplesDatainstancesusedinamachinelearningtask.SupervisedlearningMachinelearningbasedonanalgorithmthatistrainedonlabelledexamplesandusedtopredictthelabelofunlabelledexamples.theDNAsequencealone10,whereasotherstakeintoaccountChIP–seqprofilesofhistonemodification11ortranscriptionfactorbinding12atthegenepromoterregion.Moresophisticatedmethodsattempttojointlymodeltheexpressionofallofthegenesinacellbytraininganetworkmodel13.Likeaco-functionalnet-work,eachnodeinageneexpressi