二代测序2012.9.20资源•数据源:•数据下载方式AsperaFTP•数据分析交流网站处理工具汇总•(2010,全面地总结)•基因组组装:SoapDenovo,Velvet,CLC•序列比对到基因组上:BWA,bowtie,bowite2,MUMmer,SOAP,MAQ,bioscope•SNP分析工具:samtools,gatk,SOAPsnp•转录组从头组装:Trinity,VelvetOasis•数据质量分析工具:Fastx-toolkit工具包•二代数据可视化工具:IGV,Savant,samtools,Gbrowser二代数据结果输出格式•SAM,BAM––Forexample:alineinSAMfileusingBWA:–HWUSI-EAS172:628C8:4:1:1138:271816chr20462648033776M*00ACCCAAGTAAAGTAAGCAATCAGGATTCCAAGAGTCCTCTGGGCGTTTATTGCGACCAAAATCCAGTGGGGAGTTC###?::??@=:??ABC?5:8(6:*0:42C-=??C:D:D:(:)1==@=:DDDBD=:?DD-DDD;@B;;6@@6=XT:A:UNM:i:3X0:i:1X1:i:0XM:i:3XO:i:0XG:i:0MD:Z:1A42T24A6SAM主体部分•1.QNAME,read名字•2.FLAG,bitwiseflag,标识readmap到染色体上的情况•3.RNAME,染色体名字•4.START,map到染色体上的第一个位置•5.MAPPINGQUALITY,mapping的质量•6.CIGAR,比对结果情况描述(H,S,M)SAM主体部分•7.MRNM,配对read的名字•8.MPOS,配对序列的起始位点•9.ISIZE,两个reads间最远碱基的距离•10.SEQQuery:read调整到与参考基因组同链的序列•11.referenceQUAL,read的质量(ASCII-33)SAMFLAG•0X0001=1thereadispairedinsequencing•0X0002=2thereadismappedinaproperpair•0X0004=4thequerysequenceitselfisunmapped•0X0008=8themateisunmapped•0X0010=16strandofthequery•0X0020=32strandofthemate•0X0040=64thereadisthefirstreadinapair•0X0080=128thereadisthesecondreadinapair•0X0100=256thealignmentisnotprimary•0X0200=512QCfailure•0X0400=1024opticalorPCRduplicateSAM格式附加部分•NMEditdistance编辑距离,与参考基因组的差异碱基数目•MDmismatchingpositions/bases错配的碱基或位置•X0最优匹配位置的数目•X1次优匹配位置的数目•XN参考基因组中模糊碱基的数目(N)•XM错配碱基的数目•XO打开的gap数目•XG打开的gap中延伸的碱基数目•XTType:Unique/Repat/N/Mate-sw•XA其他mapping位置报告二代数据分析流程比对软件Bwa,bowtieMapping、组装结果分析、包括小RNA、SNP、INDEL、基因表达差异、基因边界确定等等RNA-seq原始数据筛选数据有参考基因组RNA-seqFastq格式质量筛选软件无参考基因组RNA-seqMappingDenovoassembly拼接软件Kissplice,velvetoasis,TrinitySra格式数据解压•fastq-dump[option]input.sra•-A/--accession赋予解压文件新的名字•--split-3分割双端测序数据•Order1)fastq-dump--split-3SRR427121.lite.sraReadfilter•Fastx-Toolkit1)$fastx_quality_stats•fastx_quality_stats[-h][-iINFILE][-oOUTFILE]2)$fastq_quality_boxplot_graph.sh•[INPUT.TXT][-tTITLE][-p][-oOUTPUT]•3)$fastx_trimmer•[-h][-fN][-lN][-z][-v][-iINFILE][-oOUTFILE]4)$fastx_nucleotide_distribution_graph.sh•[-p][-iINPUT.TXT][-oOUTPUT][-tTITLE]•5)$fastx_trimmer[-h][-fN][-lN][-tN][-mMINLEN][-z][-v][-iINFILE][-oOUTFILE]•6)$fastq_quality_trimmer[-h][-v][-tN][-lN][-z][-iINFILE][-oOUTFILE]••fastx_quality_stats-iin.fastq-oout.statShortgunreadstrimTrimedSitefastq_quality_boxplot_graph–iout.stat–ooutput–ttitleFastx-toolkit实践•Order2)nohupfastx_quality_stats-iSRR427121_1.fastq-oSRR_1.stat-Q33&•Order3.1)fastq_quality_boxplot_graph.sh-iSRR_1.stat-oSRR_1.pngD:\花\SRR_1.png•Order3.2)fastx_nucleotide_distribution_graph.sh-i*stat-oSRR_1_nucleotide_distributionD:\花\SRR_1_nucleotide_distribution.pngFastax-toolkit结果分析Fastx-toolkit结果分析TrimedSitefastq_quality_boxplot_graph–iout.stat–ooutput–ttitleTrimfastq•fastq_quality_trimmer-t20-l15-iSRR427121_1.fastq-oecoli_1.fq-Q33•fastq_quality_trimmer-t20-l15-iSRR427121_2.fastq-oecoli_2.fq-Q33Referencegenomemapping:BWA•1)建立索引bwaindex[-pprefix][-aalgoType][-c]in.db.fasta-p建立的索引的名字-a构建索引使用的算法,is试用的基因组长度〈2GB,bwtsw适合的基因组长度〉10MB-c构建color-space索引,适合solid数据比对bwaindex-pEcoli-aisNC_000913.fna2)alnbwaaln[-nmaxDiff][-omaxGapO][-emaxGapE][-dnDelTail][-inIndelEnd][-kmaxSeedDiff][-lseedLen][-tnThrds][-cRN][-MmisMsc][-OgapOsc][-EgapEsc][-qtrimQual]in.db.fastain.query.fqout.saiperlget_consesus_read.plecoli_1.fqecoli_2.fqecoli_1_trim.fqecoli_2_trim.fqecoli_trim.fq&bwaaln-t10../../../chromosome/Ecoli../ecoli_1_trim.fq-fecoli_1_trim.sai&bwaaln-t10../../../chromosome/Ecoli../ecoli_2_trim.fq-fecoli_2_trim.sai&bwaaln-t10../../../chromosome/Ecoli../ecoli_trim.fq-fecoli_trim.saiReferencegenomemapping:BWA•3)samse–bwasamse[-nmaxOcc]in.db.fastain.saiin.fqout.sambwasamse../../../chromosome/Ecoliecoli_trim.sai../ecoli_trim.fq-fecoli_trim.sam•4)sampe–bwasampe[-amaxInsSize][-omaxOcc][-nmaxHitPaired][-NmaxHitDis][-P]in.db.fastain1.saiin2.saiin1.fqin2.fqout.sambwasamse../../../chromosome/Ecoliecoli_1_trim.sai../ecoli_1_trim.fqecoli_2_trim.sai../ecoli_2_trim.fq-fecoli_trim_paired.samReferencegenomemapping:bowtie•建立索引–bowtie-build[options]*reference_inebwt_base-freferenceinputfiles(fasta)-creferencefromcommandline-C/--colorcolorbase(forsolid)bowtie-build-fNC_000913.fnaecoli•比对–bowtie[options]*ebwt{-1m1-2m2|--12r|s}[hit]Bowtie必须文件参数•bowtie[options]*ebwt{-1m1-2m2|--12r|s}[hit]-1逗号分隔的文件seg1-2逗号分隔的文件seq2--12构建的以tab键分隔的文件-qfastq格式文件-ffasta格式文件r只有序列的文件,每行一条序列Ssingleendreads-Ccolorbase文件比对-Qwith–fand–C--Q1--Q2combinationwith–f-1and–C--integer-quals--solexa1.3-quals--solexa-quals--phred33-quals--phred64-qualsAlignment参数选择•-v允许的最大mismatch的数目•-lseed的长度影响速度,l越大,速度越快•-nseed中允许的错配数•-I配对序列允许的最小插入长度•-X配对序列允许的最大插入长度•--fr默认5’3’3’5’Output参数选择•-k限制为每个read输出的最大mapping位置数目•-a报告全部的mapping位置•-m不报告mapping位置大于m的read•-M随机报告mapping位置大于M的mapping