第六章基因预测和基因结构分析(I)生物信息学基因组测序策略Genomesequencing:QUICKER,SMALLER,CHEAPERNatureBiotechnology26,1135-1145(2008)13years$3billion1day$1000(2008)identifyingnewgeneslookingatchromosomeorganizationandstructurefindinggeneregulatorysequencescomparativegenomicsApplicationsofsequencingWherearetheGenesintheGenome?GAGAAAATCAATTGGTTTAGAAGGTTTGGACTCACTTGACAGGTTCAGTTGGAGACGATCATAGGTGGCTGCTGTGACAAAGGGAAATTGTGCTTTTCCAGCATGCTTACTGACCCTGATTTACCTCAGGAGTTTGAAAGGATGTCTTCCAAGCGACCAGCCTCTCCGTATGGGGAAGCAGATGGAGAGGTAGCCATGGTGACAAGCAGACAGAAAGTGGAAGAAGAGGAGAGTGACGGGCTCCCAGCCTTTCACCTTCCCTTGCATGTGAGTTTTCCCAACAAGCCTCACTCTGAGGAATTTCAGCCAGTTTCTCTGCTGACGCAAGAGACTTGTGGCCATAGGACTCCCACTTCTCAGCACAATACAATGGAAGTTGATGGCAATAAAGTTATGTCTTCATTTGCCCCACACAACTCATCTACCTCACCTCAGAAGGCAGAAGAAGGTGGGCGACAGAGTGGCGAGTCCTTGTCTAGTACAGCCCTGGGAACTCCTGAACGGCGCAAGGGCAGTTTAGCTGATGTTGTTGACACCTTGAAGCAGAGGAAAATGGAAGAGCTCATCAAAAACGAGCCGGAAGAAACCCCCAGTATTGAAAAACTACTCTCAAAGGACTGGAAAGACAAGCTTCTTGCAATGGGATCGGGGAACTTTGGCGAAATAAAAGGGACTCCCGAGAGCTTAGCTGAGAAAGAAAGGCAACTCATGGGTATGATCAACCAGCTGACCAGCCTCCGAGAGCAGCTGTTGGCTGCCCACGATGAGCAGAAGAAACTAGCTGCCTCTCAGATTGAGAAACAGCGTCAGCAAATGGAGCTGGCCAAGCAGCAACAAGAACAAATTGCAAGACAGCAGCAGCAGCTTCTACAGCAACAACACAAAATCAATTTGCTCCAGCAACAGATCCAGGTTCAAGGTCAGCTGCCGCCATTAATGATTCCCGTATTCCCTCCTGATCAACGGACACTGGCTGCAGCTGCCCAGCAAGGATTCCTCCTCCCTCCAGGCTTCAGCTATAAGGCTGGATGTAGTGACCCTTACCCTGTTCAGCTGATCCCAACTACCATGGCAGCTGCTGCCGCAGCAACACCAGGCTTAGGCCCACTCCAACTGCAGCAGTTATATGCTGCCCAGCTAGCTGCAATGCAGGTATCTCCAGGAGGGAAGCTGCCAGGCATACCCCAAGGCAACCTTGGTGCTGCTGTATCTCCTACCAGCATTCACACAGACAAGAGCACAAACAGCCCACCACCCAAAAGCAAGGATGAAGTGGCACAGCCACTGAACCTATCAGCTAAACCCAAGACCTCTGATGGCAAATCACCCACATCACCCACCTCTCCCCATATGCCAGCTCTGAGAATAAACAGTGGGGCAGGCCCCCTCAAAGCCTCTGTCCCAGCAGCGTTAGCTAGTCCTTCAGCCAGAGTTAGCACAATAGGTTACTTAAATGACCATGATGCTGTCACCAAGGCAATCCAAGAAGCTCGGCAAATGAAGGAGCAACTCCGACGGGAACAACAGGTGCTTGATGGGAAGGTGGCTGTTGTGAATAGTCTGGGTCTCAATAACTGCCGAACAGAAAAGGAAAAAACAACACTGGAGAGTCTGACTCAGCAACTGGCAGTTAAACAGAATGAAGAAGGAAAATTTAGCCATGCAATGATGGATTTCAATCTGAGTGGAGATTCTGATGGAAGTGCTGGAGTCTCAGAGTCAAGAATTTATAGGGAATCCCGAGGGCGTGGTAGCAATGAACCCCACATAAAGCGTCCAATGAATGCCTTCATGGTGTGGGCTAAAGATGAACGGAGAAAGATCCTTCAAGCCTTTCCTGACATGCACAACTCCAACATCAGCAAGATATTGGGATCTCGCTGGAAAGCTATGACAAACCTAGAGAAACAGCCATATTATGAGGAGCAAGCCCGTCTCAGCAAGCAGCACCTGGAGAAGTACCCTGACTATAAGTACAAGCCCAGGCCAAAGCGCACCTGCCTGGTGGATGGCAAAAAGCTGCGCATTGGTGAATACAAGGCAATCATGCGCAACAGGCGGCAGGAAATGCGGCAGTACTTCAATGTTGGGCAACAAGCACAGATCCCCATTGCCACTGCTGGTGTTGTGTACCCTGGAGCCATCGCCATGGCTGGGATGCCCTCCCCTCACCTGCCCTCGGAGCACTCAAGCGTGTCTAGCAGCCCAGAGCCTGGGATGCCTGTTATCCAGAGCACTTACGGTGTGAAAGGAGAGGAGCCACATATCAAAGAAGAGATACAGGCCGAGGACATCAATGGAGAAATTTATGATGAGTACGACGAGGAAGAGGATGATCCAGATGTAGATTATGGGAGTGACAGTGAAAACCATATTGCAGGeneaGenes(i.e.,proteincoding)But...only2%ofthehumangenomeencodesproteinsOtherthanproteincodinggenes,whatisthere?•genesfornoncodingRNAs(rRNA,tRNA,miRNAs,etc.)•structuralsequences(scaffoldattachmentregions)•regulatorysequences•non-functional“junk”?It’sstilluncertain/controversialhowmuchofthegenomeiscomposedofanyoftheseclassesTheanswerswillcomefromexperimentationandbioinformatics.ComplexityofgenomePublishedbyAAASScience306,636-640(2004)TheENCODEProject:ENCyclopediaOfDNAElements–Proteincodinggenes.•Inlongopenreadingframes•ORFsinterruptedbyintronsineukaryotes•Takeupmostofthegenomeinprokaryotes,butonlyasmallportionoftheeukaryoticgenome–RNA-onlygenes•TransferRNA,ribosomalRNA,snoRNAs(guideribosomalandtransferRNAmaturation),intronsplicing,guidingmRNAstothemembranefortranslation,generegulation—thisisagrowinglist–Genecontrolsequences•Promoters•Regulatoryelements–Transposableelements,bothactiveanddefective•DNAtransposonsandretrotransposons•Manytypesandsizes–Repeatedsequences.•Centromeresandtelomeres•Manywithunknown(orno)function–Uniquesequencesthathavenoobviousfunction•Asageneralrule,eachpartofagenomicsequencehasonlyonefunction:protein-codinggene,RNAgene,controlsignal,transposableelement,repeatsequence,maybenofunctionalatall.But,mostsequenceelementsoverlaponlyslightlyifatall.What’sinagenome?protein-codinggenes,non–protein-codinggenes•easiertofindthanotherfunctionalelements•why?•genesaretranscribed—whichmeansthatwecanidentifythembylookingatRNA•traditionallythishasbeendonebycDNAorESTsequencing,morerecentlybymicroarray,SAGE,MPSS,etc.protein-codinggeneshaverecognizablefeatures1.openreadingframes(ORFs)2.codonbias3.knowntranscriptionandtranslationalstartandstopmotifs(promoters,3’poly-Asites)4.spliceconsensussequencesatintron-exonboundariesFindingprotein-codinggenesbegingeneregionstarttranslationdonorsplicesiteacceptorsplicesitestoptranslationendgeneregionsingleexonexonfinalexoninitialexon5’UTR3’UTRintronA,T,G,CFindingnon–protein-codinggenes•e.g.,tRNA,rRNA,snoRNA,miRNA,variousotherncRNAs•Hardertofindthanprotein-codinggenes•Why?•oftennotpoly-Atailed—don’tendupincDNAlibraries•noORF•constraintonsequencedivergenceatnucleotid