Sequencing the SARS Virus
The authors would like to thank Marco Marra, Steven Jones, Caroline Astell, Rob Holt, Angela Brooks-Wilson, Jas Khattra, Jennifer Asano, Sarah Barber, Susanna Chan, Allison Cloutier, Sean Coughlin, Doug Freeman, Noreen Girn, Obi Griffith, Steve Leach, Mike Mayo, Helen McDonald, Steven Montgomery, Pawan Pandoh, Anca Petrescu, Gord Robertson, Jacquie Schein, Asim Siddiqui, Duane Smailus, Jeff Stott and George Yang for scientific expertise, lab and bioinformatics efforts. We also would like to thank Kirk Schoeffel, Mark Mayo and Bernard Li for their system administration advice.
Let's do some bioinformatics using bash and a few binaries out of /bin and /usr/bin. We will calculate the GC ratio of the Tor2/SARS genome—the fraction of base pairs that are either a G or a C. Let's avoid using awk to make things interesting. First, download the sequence with wget, using -q to silence its verbose output:
> wget -q http://mkweb.bcgsc.ca/sars/AY274119.fa > head AY274119.fa gi|30248028|gb|AY274119.3| SARS coronavirus TOR2 ATATTAGGTTTTTACCTACCCAGGA...
The sequence file is in FASTA format consisting of a header line and the sequence, split into fixed-width lines. The following counts the number of Gs and Cs in the sequence and presents the total as a fraction of the total number of bases:
> grep -v "^>" AY274119.fa | fold -w 1 | tr "ATGC" "..xx" | sort | uniq -c | sed 's/[^0-9]//g' | t -s "\012" " " | sed 's/\([0-9]*\) \([0-9]*\)/scale = 3; ↪\2 \/ (\1+\2)/' | bc -i scale = 3; 12127 / (17624+12127) .407
Out of the 29,751 bases in our sequence, 12,127 are either G or C, giving a GC content of 41%.
GSC MySQL LIMS
We collected 3,250 sequencing reads containing 2.1 million quality base pairs contributing toward the initial draft assembly. This represented roughly 70X redundant coverage of the genome. WGS is usually done to no more than 10X, but for us, time was of the essence, and we wanted to avoid delays associated with finishing regions that were not fully covered by the first round of sequencing.
SELECT SUM(Sequence_Length) AS bp_tot, AVG(Quality_Length) AS bpq_avg, SUM(Quality_Length) AS bp_qual_tot, COUNT(Well) AS reads, Sequence_DateTime AS date, Equipment_Name AS equip FROM Equipment, Clone_Sequence, Sequence_Batch, Sequence, Plate, Library, Project WHERE FK_Sequence_Batch__ID=Sequence_Batch_ID AND FK_Plate__ID=Plate_ID AND FK_Library__Name=Library_Name AND FK_Equipment__ID=Equipment_ID AND FK_Project__ID=Project_ID AND FK_Sequence__ID=Sequence_ID AND Sequence_Subdirectory like "SARS2%" AND Quality_Length > 100 AND Sequence_DateTime < "20030413" GROUP BY Sequence_ID ORDER BY Sequence_DateTime; bp_tot bpq_avg bp_tot reads date equip 437256 612.6399 205847 336 2003-04-11 21:07:06 SARS212.B21 D3730-3 412366 752.1074 245187 326 2003-04-11 22:15:34 SARS213.B21 D3730-1 269456 639.1926 225635 353 2003-04-11 22:22:34 SARS215.B21 D3700-6 130525 715.5060 118774 166 2003-04-11 22:25:44 SARS216.B21 D3700-5 282490 682.6311 249843 366 2003-04-11 22:27:14 SARS215.BR D3700-4 310119 612.7601 212015 346 2003-04-11 22:31:56 SARS213.BR D3700-1 182573 681.4975 136981 201 2003-04-11 22:36:40 SARS216.BR D3700-3 301471 642.2273 226064 352 2003-04-12 01:58:16 SARS212.BR D3700-2 401595 690.5204 220276 319 2003-04-12 05:13:26 SARS211.BR D3730-3 460100 642.0468 219580 342 2003-04-12 06:20:52 SARS214.BR D3730-2 182360 471.7832 67465 143 2003-04-12 07:14:44 SARS214.B21 D3730-1
Growth of Genbank: www.ncbi.nlm.nih.gov/Genbank/genbankstats.html
How Perl Saved the Human Genome Project: bioperl.org/GetStarted/tpj_ls_bio.html
Image of a Coronavirus: www3.btwebworld.com/vdg/gallery/Coronavirus.jpg
SARS Issue of Science: www.sciencemag.org/feature/data/sars
Timeline of SARS History: www.worldhistory.com/sars.htm
UCSC Assembly of Human Genome: www.cse.ucsc.edu/~learithe/browser/goldenPath/algo.html
Martin Krzywinski (firstname.lastname@example.org) is a bioinformatics research scientist at Canada's Michael Smith Genome Sciences Centre. He spends his time applying Perl to problems in physical mapping and data-processing automation. In his spare time he can be found encouraging his cat to stick to her diet.
Yaron Butterfield (email@example.com) leads the sequencing bioinformatics team at Canada's Michael Smith Genome Sciences Centre. He and his group develop DNA sequence analysis and visualization software and pipelines for various genome and cancer-based research projects.
Linux Journal Annual Archive
Editorial Advisory Panel
Thank you to our 2014 Editorial Advisors!
- Jeff Parent
- Brad Baillio
- Nick Baronian
- Steve Case
- Chadalavada Kalyana
- Caleb Cullen
- Keir Davis
- Michael Eager
- Nick Faltys
- Dennis Frey
- Philip Jacob
- Jay Kruizenga
- Steve Marquez
- Dave McAllister
- Craig Oda
- Mike Roberts
- Chris Stark
- Patrick Swartz
- David Lynch
- Alicia Gibb
- Thomas Quinlan
- Carson McDonald
- Kristen Shoemaker
- Charnell Luchich
- James Walker
- Victor Gregorio
- Hari Boukis
- Brian Conner
- David Lane