二代测序的建库与测序原理何有裕yyhe@sibs.ac.cnyyhe@biosino.com.cn上海生物信息技术研究中心上海众信生物技术有限公司苏州众信生物技术有限公司内容样本处理与测序原理简介罗氏454Illuminasolexa原始数据质量控制TRUSEQRNAANDDNASAMPLEPREPARATIONCLUSTERGENERATIONOVERVIEW~1000-6000moleculesperclusterOHOHflowcelldiolP7P5ClusterGeneration,TemplateHybridizationdioldiolTemplatehybridizationdioldiolInitialextensiondioldiolDenaturationdioldiol1stcycledenaturation1stcycleannealingdioldioln=251stcycleextensiondioldioldioldiol2ndcycledenaturation2ndcycleannealingdioldioldiolClusterGeneration,BridgePCRdioldioldiol2ndcycleextensionTEMPLATEPREPARATION-BRIDGERCRAdaptorligationSurfaceattachmentBridgeamplificationDenaturationTrendsinGenet24:133(2008)CAGTCATCACCTAGCGTA5’GTCAGTCAGTCAGT3’5’FirstbaseincorporatedCycle1:AddsequencingreagentsDetectSignalCleaveTerminatorandDyeCycle2-n:AddsequencingreagentsandrepeatSEQUENCINGBYSYNTHESISOVERVIEWCYCLICREVERSIBLETERMINATION•Allfourlabeledreversibleterminatorsareaddedpercycle•Removeunincorporatedbasesanddetectsignal•RemovetheterminatinggroupandthefluorescentdyeTrendsinGenet24:133(2008)TerminatinggroupFluorophorecleavageNatRevGenet11:31(2010)BASECALLINGFLOWCELLLAYOUTONGAIIAflowcellcontains8lanesLane1Lane2Lane8...Column1Column2TileEachlanecontains2columnsEachcolumncontains60tilesEachtileisimaged4timespercyclePRIMARYDATAANALYSISBYFIRECRESTANDBUSTARDINRTA/OLBtiffimagefileIntensityfileFirecrestBustardX,YACGTCycle1Cycle2PositionACGTX,YSequenceSequencefileOHdioldiolOHClusterGeneration,SequencingPrimerHybridization(Single测序方式处理步骤)LinearizationOHBlockingwithddNTP()DenatureandHybridizationSBS3OHSEQUENCEMULTIPLESAMPLESINTHESAMELANESDNAinsertRead1IndexReadRead2DNAinsertIndexIndexSPRd2SPRd1SPMultiplexing–multiplesamplesinthesamelanesPAIR-END测序优势Read1Read2KnownDistanceRepetitiveDNASinglereadmapstomultiplepositionsPairedreadmapsuniquelyMATE-PAIR建库和测序Read1Read2KnownDistanceMolecularEcologyResources(2011)TEMPLATEPREPARATION-EMULSIONPCRTrendsinGenet24:133(2008)FragmentationLigationWater-in-oilemulsionMirco-reactoremPCRPicoTiterPlateloadingPYROSEQUENCINGSingledNTPtypeflowspercycleInorganicpyrophosphate(PPi)drivesvisiblelightthroughaseriesofreactionsRemoveunincorporatednucleotideTrendsinGenet24:133(2008)BASECALLING•HomopolymererrorGV633020灵活的多样本标签技术ATATCGCGAGLTACTGAGCTAKTGATACGTCTJTCTCTATGCGITAGTATCAGCHCTCGCGTGTCGCGTGTCTCTAFATCAGACACGEAGCACTGTAGDAGACGCACTCCACGCTCGACABACGAGTGCGTASequenceMIDATATCGCGAGLTACTGAGCTAKTGATACGTCTJTCTCTATGCGITAGTATCAGCHCTCGCGTGTCGCGTGTCTCTAFATCAGACACGEAGCACTGTAGDAGACGCACTCCACGCTCGACABACGAGTGCGTASequenceMIDPrimerAMIDKeyLibraryfragmentPrimerBSequencingprimer454、SOLEXA测序模式454solexaSingleSingle或什么都不说PairendPairendMatepairDetectH+releasedasavoltagechange—fastCommonmicrochipdesignstandards—low-costmanufacturingSequencingvolumeisincreasingSemiconductorsequencingFASTA序列格式Fastq文件用4行记录一条序列第一行以@字符开头,跟在后面的是序列标识和描述第二行是序列字符第三行以+字符开头,后面可以为空,或者和第一行一样第四行是第二行序列质量数据的编码,长度需和第二行一样@HWI-ST507:211:C18E6ACXX:2:1101:1688:19921:N:0:GAGTGGCGACAATTTTTTTTGATATTAATAAAGATAGAACTTTCTTCCTATGAGTTTTCTCTC+CCCFFDFFHHHHGJJGHIIJGIIJJJJIIJJHJJJJJIJJIIIGIIIJGGIHJDIJIGAHEHFFGHGHEExample:ILLUMINASEQUENCEIDENTIFIERS@HWI-EAS364_0004:4:1:995:9044#0/1HWI-EAS364_0004仪器唯一名称4FlowcellLane1在FlowcellLane中Tile编号995在Tile中簇的x坐标9044在Tile中簇的y坐标#0混合样本中的index编号(0代表没有index)/1Pair配对的成员Casava1.8以前的序列标识ILLUMINASEQUENCEIDENTIFIERS@HWI-ST507:211:C18E6ACXX:2:1101:1688:19921:N:0:GAGTGGHWI-ST507仪器唯一名称211RunIDC18E6ACXXFlowcellID2FlowcellLane1101在FlowcellLane中Tile编号1688在Tile中簇的x坐标1992在Tile中簇的y坐标1Pair配对的成员(1或者2)NRead是未通过过滤(Y:read是坏的,N:read是好的)0Controlbits,0表示controlbits没有设置GAGTGGIndex序列Casava1.8的序列标识序列质量附:Solexa1.3以前的quality计算公式是:Quality计算:Q是用phredqualityscore的计算方式计算得到:p是对应的碱基call错的概率计算得到的Q值是一个整数,将这个Q值加上33或者64后再转换成ASCII字符SSSSSSSSSSSSSSSSSSSSSSSSSSSSSSSSSSSSSSSSS...................................................................XXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXX.........................................IIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIII...........................................JJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJ..........LLLLLLLLLLLLLLLLLLLLLLLLLLLLLLLLLLLLLLLLLL........................................!#$%&'()*+,-./0123456789:;=?@ABCDEFGHIJKLMNOPQRSTUVWXYZ[\]^_`abcdefghijklmnopqr|||||335964731040........................26...31.......40-5....0........9.............................400........9.............................403.....9.............................400........................26...31........41S-SangerPhred+33,rawreadstypically(0,40)X-SolexaSolexa+64,rawreadstypically(-5,40)I-Illumina1.3+Phred+64,rawreadstypically(0,40)J-Illumina1.5+Phred+64,rawreadstypically(3,40)with0=unused,1=unused,2=ReadSegmentQualityControlIndicator(bold)(Note:Seediscussionabove).L-Illumina1.8+Phred+33,rawreadstypically(0,41)Q值对应ASCII码454原始数据图片、SFF格式、FASTA格式(QUAL)HSAPGDX01D1KDAlength=181xy=1540_3788region=1run=R_2012_08_01_00_39_39ACGTGTTCTGAGCCATATTGCGGTACTGGAAGGTGCGCCTGCACTGTCTGAGCACTGGTCACTGCTCGATACCAATGAAGCCTTATTTGATGAGGCGCGCACCACGCAGGCGGCGACTATTATCTTCTCGTTTGATCCAGAATAACCAAATCGAAAACGCTGGCAAGGCACACAGGGGATAHSAPGDX01D1KDA