arXiv:q-bio/0405004v1[q-bio.BM]5May2004InformationTheoryinMolecularBiologyChristophAdami1,21KeckGraduateInstituteofAppliedLifeSciences,535WatsonDrive,Claremont,CA917112DigitalLifeLaboratory139-74,CaliforniaInstituteofTechnology,Pasadena,CA91125AbstractThisarticleintroducesthephysicsofinformationinthecontextofmolecu-larbiologyandgenomics.Entropyandinformation,thetwocentralconceptsofShannon’stheoryofinformationandcommunication,areoftenconfusedwitheachotherbutplaytransparentroleswhenappliedtostatisticalensem-bles(i.e.,identicallypreparedsets)ofsymbolicsequences.Suchanapproachcandistinguishbetweenentropyandinformationingenes,predictthesec-ondarystructureofribozymes,anddetectthecovariationbetweenresiduesinfoldedproteins.Wealsoreviewapplicationstomolecularsequenceandstructureanalysis,andintroducenewtoolsinthecharacterizationofresis-tancemutations,andindrugdesign.Inacurioustwistofhistory,thedawnoftheageofgenomicshasbothseentheriseofthescienceofbioinformaticsasatooltocopewiththeenormousamountsofdatabeinggenerateddaily,andthedeclineofthetheoryofinformationasappliedtomolecularbiology.Hailedasaharbingerofa“newmovement”(Quastler1953)alongwithCybernetics,theprinciplesofinformationtheorywerethoughttobeapplicabletothehigherfunctionsoflivingorganisms,andabletoanalyzesuchfunctionsasmetabolism,growth,anddifferentiation(Quastler1953).Today,themetaphorsandthejargonofinformationtheoryarestillwidelyused(MaynardSmith1999a,1999b),asopposedtothemathematicalformalism,whichistoooftenconsideredtobeinapplicabletobiologicalinformation.Clearly,lookingbackitappearsthattoomuchhopewaslaiduponthistheory’srelevanceforbiology.However,therewaswell-foundedoptimismthatinformationtheoryoughttobeabletoaddressthecomplexissuesas-sociatedwiththestorageofinformationinthegeneticcode,onlytoberepeatedlyquestionedandrebuked(see,e.g.,Vincent1994,Sarkar1996).Inthisarticle,Ioutlinetheconceptsofentropyandinformation(asdefinedbyShannon)inthecontextofmolecularbiology.Weshallseethatnotonlyarethesetermswell-definedanduseful,theyalsocoincidepreciselywithwhatweintuitivelymeanwhenwespeakaboutinformationstoredingenes,forexample.Ithenpresentexamplesofapplicationsofthetheorytomeasuringtheinformationcontentofbiomolecules,theidentificationofpolymorphisms,RNAandproteinsecondarystructureprediction,thepredictionandanalysisofmolecularinteractions,anddrugdesign.1EntropyandInformationEntropyandinformationareoftenusedinconflictingmannersinthelit-erature.Apreciseunderstanding,bothmathematicalandintuitive,ofthenotionofinformation(anditsrelationshiptoentropy)iscrucialforapplica-tionsinmolecularbiology.Therefore,letusbeginbyoutliningShannon’soriginalentropyconcept(Shannon,1948).1.1Shannon’sUncertaintyMeasureEntropyinShannon’stheory(definedmathematicallybelow)isameasureofuncertaintyabouttheidentityofobjectsinanensemble.Thus,while“en-1tropy”and“uncertainty”canbeusedinterchangeably,theycannevermeaninformation.ThereisasimplerelationshipbetweentheentropyconceptininformationtheoryandtheBoltzmann-Gibbsentropyconceptinthermody-namics,brieflypointedoutbelow.Shannonentropyoruncertaintyisusuallydefinedwithrespecttoapartic-ularobserver.Moreprecisely,theentropyofasystemrepresentstheamountofuncertaintyoneparticularobserverhasaboutthestateofthissystem.Thesimplestexampleofasystemisarandomvariable,amathematicalobjectthatcanbethoughtofasanN-sideddiethatisuneven,i.e.,theprobabilityofitlandinginanyofitsNstatesisnotequalforallNstates.Forourpurposes,wecanconvenientlythinkofapolymeroffixedlength(fixednum-berofmonomers),whichcantakeonanyoneofNpossiblestates,whereeachpossiblesequencecorrespondstoonepossiblestate.Thus,forase-quencemadeofLmonomerstakenfromanalphabetofsizeD,wewouldhaveN=DL.Theuncertaintywecalculatebelowthendescribestheob-server’suncertaintyaboutthetrueidentityofthemolecule(amongaverylargenumberofidenticallypreparedmolecules:anensemble),giventhatheonlyhasacertainamountofprobabilisticknowledge,asexplainedbelow.Thishypotheticalmoleculeplaystheroleofarandomvariableifwearegivenitsprobabilitydistribution:thesetofprobabilitiesp1,...,pNtofinditinitsNpossiblestates.Letusthuscallourrandomvariable(randommolecule)“X”,andgivethenamesx1,...,xNtoitsNstates.IfXwillbefoundinstatexiwithprobabilitypi,thentheentropyHofXisgivenbyShannon’sformulaH(X)=−NXi=1pilogpi.(1)Ihavenotherespecifiedthebasisofthelogtobetakenintheaboveformula.Specifyingitassignsunitstotheuncertainty.ItissometimesconvenienttousethenumberofpossiblestatesofXasthebaseofthelogarithm(inwhichcasetheentropyisbetweenzeroandone),inothercasesbase2isconvenient(leadingtoanentropyinunits“bits”).Forbiomolecularsequences,acon-venientunitobtainsbytakinglogarithmstothebasisofthealphabetsize,leadingtoanentropywhoseunitsweshallcall“mers”.Then,themaximalentropyequalsthelengthofthesequenceinmers.LetusexamineEq.(1)moreclosely.Ifmeasuredinbits,astandardinterpretationofH(X)asanuncertaintyfunctionconnectsittothesmallestnumberof“yes-no”questionsnecessary,onaverage,toidentifythestateof2randomvariableX.Becausethisseriesofyes/noquestionscanbethoughtofasadescriptionoftherandomvariable,t