第六节:NGS数据格式和处理流程Illumia测序的基本化学过程建库连接搭桥合成双链成簇加入带荧光的碱基1,识别碱基1加入带荧光的碱基2,识别碱基2逐次类推推测序列,basecalling基因组定位Paired-end双端DNA样品建库过程ReferencegenomediscordantpairReferencegenomeReferencegenome400bpPlusstrand100-bpMinusstrand100-bp10010020010010010KbChr2Chr1Concordantpairdiscordantpair5’reads3’readPaired-endlibSingle-endlibSingle-endlibPaired-endlib2KblongPaired-end双端mRNA样品建库过程mRNA反转录成cDNAMultiplexingandBarcode(Illumina)用一个lane测序几十样品的方法用一个很短的特异的序列,连接到每个样本的DNA片段的末端,作为这个样本的标识然后把所有的样本混在一起测序获得测序结果后,根据reads上的短序列标示,区分哪个reads是来自哪个样本96samples/lanemaximallyWhattoconsiderwhenplanningaNGSexperiment1.Platform普遍问题:Illumina2000,Ilmina2500,IlluiminaXten?多少reads?建库费用?每Gb的费用?生物信息分析费用?2.Coverage覆盖度:howmanyreadsperDNAorRNAsample3.Barcoding:多少barcode样品?如何计算测序的覆盖度适合你的DNA/RNA样品,例如基因组大小是200Mb.预期覆盖度20倍(20x)HiSeq2500一个lane可生产200million(2亿核苷酸)100-ntpaired-endreads,一个lane的总base数是:200millionx100-ntx2ends=40Gb200Mbx20times=4000Mb(4Gb)coverageforonesample.40Gb/4Gb=10samplesbarcodedinonelane.但是不同barcode的样品的产量是不同的。Illumina的双端(Paired-end)和单端(single-end)测序•Paired-end测序在基因组重测序(genomere-sequencing),基因组和转录组从头测序(denovogenomesequencingordenovotranscriptomesequencing),表观遗传测序(ChIP-Seq,DNAmethylation,TFbinding)中非常重要。现在测序公司基本都采用双端测序。•一对“正向(forward)和反向(reverse)”的reads中间的距离为200to500bp之间(thesizeofthecDNAorDNAfragmentsinthelibrary).•Denovo,是拉丁语里“fromthebeginning,”的意思。对于一个没有基因组序列,从未研究的物种,可以叫denovosequencing,denovoassembly.Denovo拼接消耗的计算量非常大,也更不准确•Resequencing,相对denovo而言,指某一物种的参考序列已经测序完毕,例如MaizeB73,再测其他的系、亚种、突变种后,可以依据参考序列进行拼接,即为reference-guidedassembly。主要作用是减少拼接的错误和降低计算量。IlluminaNGS数据文件类型Illumina的测序结果最终输出的文件为fastq或者fq为后缀的文件类型1.Fastq文件里,兼有reads的序列信息fasta和每个碱基的测序质量信息q2.singleend测序结果只有一个fastq文件3.pairedend测序结果有一对fastq文件,标注(R1和R2)4.一个/一对fastq文件一般对应一个测序的样本AscripttodecompressandcombinefilestooneR1.oneendfastqR2.theotherendfastqOneoriginalfastqfile(10Gb)issplitandcompressedintosmallfiles(400Mb).Thus,youneedtodecompressandmergethembeforeusingthem.neditunzip_combine.sh#!/usr/bin/tcshls*R1*.fastq.gz|whilereadR1;dogunzip-c$R1combined.R1.fastqdonels*R2*.fastq.gz|whilereadR2;dogunzip-c$R2combined.R2.fastqdonechmod+xunzip_combine.sh./unzip_combine.sh&ls-l*.fastqYouneedtodeletecombinedfastqifyouneedtorerununzip_combine.shonthesamedataset!!!!!!Fastq文件长什么样?Readslongerthan18-ntaremergedfromsplitfilesBarcodeprimerGCCAAT没有被去掉R1andR2表明双端测序001~005表明一个完整大文件,被分卷储存Barcode被去掉了(trimmed)catR1_001.fq...R1_005.fqR1.fqcatR2_001.fq...R2_005.fqR2.fqR1和R2文件要分别合并单端的reads,一般可能存在问题,不被采用。150-Gbfrom2Lanes8.5-Gb解压缩、压缩程序tarDecompressthecompressedtar.ziporgzfiletar-xzfWLH5-D172_lane1.fq.tar.gzIftheendlikesthis080702_I361_FC307AWAAXX_L1_ARAdxwAEMDWA.fq原文件名字冗长无意义,需要用mv命令改成可以识别,有含义的名字mv080702_I361_FC307AWAAXX_L1_ARAdxwAEMDWA.fqsample_lane_date.fastq-fuseanarchivefile-zinvokegzip-ccompress-xdecompress用gzipgunzip命令压缩和解压缩文件压缩gzipA_19.fq-9&(8.5Gb)-9forbestcompression(bepatient,1hourforafile)Theoriginal.fqwillberemoved.Checkwhetherzippingisdone:Run“ps”tochecktheprocessAftermappingisdone,compressthe*.fastqfiles,会生成zip文档,用于备份。解压缩gunzipA_19.fq.gz&(3Gb)下载公共NGS数据,SRAshortreadarchive数据库*.taror*.tar.gzor*.gz,whichareusuallyprocessedfilesdecompressthemfirst,suchasgeneexpressionlevelsetc.SRX016120GSM489073Dna_Nipp_9311DNAmethyl_McrBCSeqIfyoudownloadrawdatafromGEO’sshortreadarchive(SRA).YouneedsraToFastq_folder_converter.pltoconverttheformattarxvzffile.tar.gzRawdatafileformatsof454,IlluminaandSOLiDCanberecognizedbythefilesuffix(extension).Illumina:*.fastqor*.fq(onefileperlaneorbarcodedsample)ZmB73_6DAP_RNA.fastq(10Gb)solid309_20100923_FRAG_BC_yadegari_F3_6DAP.csfasta(6.2Gb)solid309_20100923_FRAG_BC_yadegari_F3_6DAP.qual(14Gb)solid309_20100923_FRAG_BC_yadegari_F3_6DAP.stats(78kb)SOLiD:apairof*.csfastaand*.qual(perlaneorpersample)454:apairof*.fastaand*.qual(persample)CFGU.fasta(200Mb)CFGU.qual由于SOLiD和454都已经从市场推出了,就不介绍这两种文件了ThecontentinaIlluminafastqfile每一个read有对应的四行信息ReadIDReadsequence76-ntQualityforeachnucleotideP=0.001,Q=30,P=0.0001,Q=40,thehigherQthehighaccuracymorecombined.R1.fastqtoviewthecontentofthefilewccombined.R1.fastqtocounthowmanylines,thenline#dividedby4,isthe#ofreadsinyoursampleNGS数据质量检查NGSdataQualityCheck读懂质量报告非常重要,有助于你了解建库、测序中的问题所在,以及和测序公司沟通,是否需要重新测序FastQCAqualitycontrolapplicationforhighthroughputsequencedataDevelopedinJAVAandcanberunforWin,LinuxandMACSupportformats:fastqsambamfastqc_report.html会生成一个网页格式的报告文件用boxplot展示每个位置的碱基的整体质量测序质量一般从5’端向3’端降低Distributionofquality30isGoodP=0.001-Log10(P=0.001)=30P=0.01-Log10(P=0.001)=20Distributionofquality30isBad每个read的质量分布GoodBad有问题的低质量readsGC含量分布,GC含量越平稳,测序质量越好BadGood越少的没有识别的碱基(标为N)说明质量越高BadGoodUncalledbasesNotoomanyduplicated/overrepresentedreadsBadGoodForexample,somehighlyexpressedchloroplastgenescanaccountfor50%ofthereads.Inmaizeendosperm40%ofreadscomingfromagenefamilyzein.BlastduplicatedreadstotheNCBItoruleoutanycontamination过多重复的readsBadGoodK个碱基重复的频率Kmerfrequency如果某些k-mer出现频率很高的话,可能有问题!首先要检查barcode序列是否去掉了ThatpatternATGCCGTCTyouareseeingist