基因预测方林2006-03-01基因预测的方法●从头预测(abinitio)●同源比对●从头预测和同源比对相结合的方法基因预测基础●编码区的隐马尔可夫模型●起始密码子●终止密码子●剪切位点(acceptor,branchpoint,donor)●转录起始位点●PolyA基因预测软件的评估●目前基因预测的准确性,碱基水平上为~80%,外显子水平上为:~45%,基因水平上为:~20%●假阴性(FN)和假阳性(FP)●灵敏度(Sn)=TP/(TP+FN)●特异性(Sp)=TP/(TP+FP)不同方面的基因预测●原核生物的基因预测●真核生物的基因预测●tRNA基因的预测●microRNA的预测原核生物的基因预测●Glimmer●GeneMark●FgeneSH●GRAIL●GeneFinder●GetORFGLIMMER●TIGR开发的原核基因预测软件●预测的准确性在97~98%之间,FN在~1%●所用的模型是内插马尔可夫模型●主页:的使用●训练提取最长的ORF:long-orfs提取ORF的序列:extract生成训练参数:build-icm●用训练好的参数进行基因预测:glimmer2预测结果的格式1310158[-2L=153r=-1.296][ShadowedBy#23]21141494[-2L=648r=-1.345][ShadowedBy#3]34312152[+2L=1722r=-1.360][Contains#2][OlapWith#8L=121S=6][DelayedBy#23L=306]421522292[+1L=141r=-1.291][ShadowedBy#8]NCBIORFFinderORFFinder结果真核生物基因预测●Genscan●BGF●FgeneSH●GeneMark●Genid●GRAILGenScan●C.Burge●27态模型●目前有人,玉米和拟南芥三套参数●主页:●对人的基因预测较为理想GenScan的使用●genscanparamfileseqfile[options]●参数有:-v:显示详细的帮助文档-cds:输出预测基因的CDS序列-suboptn:显示分数大于阈值的外显子,最小为0.01-psfs:输出文件名为f缩放率为s的PostScript格式的结果GenScan文本结果Gn.ExTypeS.Begin...End.LenFrPhI/AcDo/TCodRgP....Tscr..------------------------------------------------------1.01Init+166417741111094832120.99721.331.02Intr+2042222017912104664080.99740.121.03Intr+237425331601189943020.99932.081.04Term+3231335012020115482020.96118.31GenScan的图形结果BGF●自主开发●目前有水稻,果蝇和家蚕三套参数●主页:●对水稻的基因预测比其他预测软件好●新版的性能有提升明显BGF的使用●./bgf[options]paramfileseqfile1seqfile2...●有用的参数有:-e显示所有的外显子BGF的主页BGF的结果Gene#SExon#TypeStartEndORF_SORF_EScoreLen==========================================================1+1Term53-215553-21558.9621031+PolA6076--0.272+Prom6210--3.252+1Sngl7290-92607290-92608.7719712+PolA14191-0.483+Prom15397--3.853+1Init15874-1598415874-159845.511113+2Intr16252-1643016252-164287.531793+3Intr16584-1674316585-167437.771603+4Intr18207-1829618207-182960.4090FgeneSH的使用FgeneSH的预测结果(文本)FgeneSH的预测结果(图形)GeneMarkGeneMark的结果(文本)GeneMark结果(图形)TwinScan●用C++实现的GenScan●同源和从头预测并重●主页:●目前有拟南芥,人,线虫和隐球菌●在外显子和基因水平上有显著提高TwinScan的使用●用BLAST进行比对●从自带的conseq.pl提取保守区信息conseq.pl[options]seqfileblastfile1blastfile2...●进行基因预测iscan[options]hmmfileseqfile[-c=conseqfile|-a=alignfile][-e=estfile]TwinScan的预测结果#../bin/iscan#Date:FriFeb2415:45:102006#Twinscanversion3.0build20051110RB#GenomeParameters:../parameters/human_iscan-9993-genes-09-13-2004.zhmm#TargetSequence:21dna:chromosomechromosome:NCBI35:21:44344133:44444133:1#TargetSequenceRead...100001bpC+G=55.7424%#Thisisthe1-thbestpath.#Score:3972.07chr.faiscanstart_codon17401742.+0gene_idchr.fa.001;transcript_idchr.fa.001.1;chr.faiscanCDS17402120113+0gene_idchr.fa.001;transcript_idchr.fa.001.1;chr.faiscanCDS26952866117+0gene_idchr.fa.001;transcript_idchr.fa.001.1;chr.faiscanCDS29553149297+2gene_idchr.fa.001;transcript_idchr.fa.001.1;chr.faiscanCDS34703693380+2gene_idchr.fa.001;transcript_idchr.fa.001.1;chr.faiscanCDS83718632110+0gene_idchr.fa.001;transcript_idchr.fa.001.1;tRNAScan-SE●预测真核和原核的tRNA基因●预测的准确度在99%●预测速度在30k/s●主页:●类似的软件有pol3scan和FAStRNAtRNAScan-SE的用法●tRNAScan-SE[options]seqfile1seqfile2...●其他重要的参数:-Bor-P:预测细菌的tRNA基因-A:预测古细菌的tRNA基因-O:预测线粒体或叶绿体的tRNA基因-G:预测一般的tRNA基因tRNAScan-SE的结果SequencetRNABoundstRNAAntiIntronBoundsCoveNametRNA#BeginEndTypeCodonBeginEndScore------------------------------------------------NT_10718718224582317ArgACG0074.06NT_1071872692014692087ArgACG0061.16NT_1071873694990695062MetCAT0055.64NT_1071874919043919114GlnCTG0066.11NT_107187516648751664946CysGCA0079.11NT_107187618446811844751GlyGCC0071.56NT_107187713159001315827ValAAC0076.95NT_107187813149651314892ValAAC0064.75NT_107187913140011313928ValAAC0075.06NT_1071871012859711285899AlaAGC0068.12NT_10718711967579967506ThrAGT0076.42NT_10718712726035725962GluTTC0065.35microRNA的结合位点的预测●对给定的mRNA进行microRNA的可能结合位点的预测●已经成功应用在人,果蝇和斑马鱼●程序的原理是先进行局部比对,然后计算结合的热力学稳定性●比较成熟的软件是miranda,主页:的使用和结果●mirandaqueryseqtargetseq[options]Forward:Score:141.000000Q:2to22R:41379to41403AlignLen(24)(75.00%)(75.00%)Query:3'gTCGAAAG---TT-TTACTAGAGTG5'||||||||||||||||||Ref:5'gAGCTTTCGGGAACAAGGAACTCAC3'Energy:-21.010000kCal/MolScoresforthishit:embl|AJ550546|DME550546NT_107187141.00-21.010.0022241379414032475.00%75.00%