2013%&%BMMB%597D:%Analyzing%Next%Generaon%Sequencing%Data%%%Week%2,%Lecture%4%István'Albert''Bioinformacs%Consulng%Center%%Penn%State%BioStar:%queson%of%the%day%Sequencing%Technology%Reviews%One%slide%biology%summary%%needed%for%this%lecture%• Cells%contain%genec%material%!%DNA%!%large%molecules%called%nucleic'acids'only%four%of%them%Adenine,'Thymine,'Guanine,'Cytosine%'• Stored%in%long%segments%!%chromosomes%!%sum%of%chromosomes%!%genome%• Parts%of%the%genome%get%transcribed%then%translated%into%proteins%!%those%are%made%of%amino8acids'(23%of%them):%Alanine,%Threonine,%Glycine,%Cysteine,%…%• Sequencing%technologies%can%be%used%to%idenfy%nucleic%acids%(DNA)%• Important:%we%cannot%directly%sequence%DNA%of%a%cell%!%there%is%always%a%laboratory%protocol%of%substanal%complexity%!%library%preparaon%• Sequencing%instruments%sequence%libraries.%%Sequencing%Technologies%&%perspecve%1st''genera;on:%Frederic%Sanger%develops%DNA%sequencing%technology.%Latest%versions%3%million%bases/day,%1500bp%long%reads%%2nd'genera;on:%(next&gen)%sequencing%started%2005%with%the%release%of%the%454%sequencing%pla\orm.%600%billion%bases/week,%150bp%long%reads%%3rd'genera;on:%single%molecule%(no%DNA%amplificaon%required),%these%are%not%replacing%but%augumenng%2nd%generaon%systems,%longer%reads,%shorter%turnarounds%%%%Sequencing%Instruments%The%wide%range%of%characteriscs%among%available%pla\orms%provides%opportunies%both%to%conduct%groundbreaking%studies%and%to%waste%money%on%scales%that%were%previously%infeasible.%%One%of%the%most%elusive%formats:%FASTA%• Seemingly%trivial%but%it%is%also%“under&specified”%%• Many%“custom”%extensions%%• Tools%make%assumpons%• Surprising%number%of%%problems%FASTA%format%idenfier%sequence%(in%a%certain%alphabet)%The%alphabet%is%similar%to%an%ontology:%what%are%the%valid%characters%to%describe%the%sequence%Alphabets%• Internaonal%Union%of%Pure%and%Applied%Chemistry%(IUPAC)%codes%• Nucleic%acid%sequences%• Pepde%sequences%!%polypepdes%could%be%proteins%A%mul%record%FASTA%It%is%not%clear%what%the%sequence%above%contains%nucleic%acids%or%aminoacids%%(feels%like%a%nucleic%acids%because%of%having%so%many%ACTG%both%those%are%also%valid%amino%acids)%idenfier%extra%info%Pi\alls%• The%length%of%sequences%was%not%regulated%%(a%giganc%oversight!)%%if%FASTA%files%were%set%to%say%80%character%limits%we%easily%index%then%randomly%access%any%interval%inside%it!%%%• Strange%things%will%happen%if%one%were%to%flaeen%(linearize)%a%giganc%sequence%&%human%chromosome%300%million%bases%–%tools%may%break%in%spectacular%ways%More%consideraons%• Many%tools%will%embed%extra%informaon%into%either%the%idenfier%or%the%“free%zone”%the%descripon%secon%• See%the%FASTA%format%wiki%page%First%step%of%any%sequence%processing%step%understand%your%FASTA%file%• What’s%is%what%in%this%file%1. How%many%sequences%do%we%have%%2. Are%sequences%all%on%a%single%line%or%over%mulple%lines%%3. What%is%the%idenfier,%what%is%embedded%in%the%descripon%%Understand%your%file%What%is%the%idenfier%and%descriptor?%Are%sequences%wrapped?%%Count%the%number%of%“”%versus%the%number%of%lines%A%handy%command%line%calculator%Wrieen%in%1975%!%arbitrary%precision%calculator%%A%short%demo%on%numerical%precision%Vary%the%numbers%in%cell%B1:%1,%1000,%1E6,%1E20,%1E32%Cut%out%the%beginning%of%each%read%Produce%the%list%of%%subsequences%with%counts%Homework%4%The%facility%reports%that%each%sequence%in%the%lec4.fa'file%contains%a%10%base%long%barcode%followed%by%a%10%base%primer%sequence.%%%There%should%be%4%barcodes%and%1%primer%across%all%sequences.%The%primer%though%may%have%a%mismatch.%%Verify%this%statement%and%report%your%findings.%