Sequencing the SARS Virus
The authors would like to thank Marco Marra, Steven Jones, Caroline Astell, Rob Holt, Angela Brooks-Wilson, Jas Khattra, Jennifer Asano, Sarah Barber, Susanna Chan, Allison Cloutier, Sean Coughlin, Doug Freeman, Noreen Girn, Obi Griffith, Steve Leach, Mike Mayo, Helen McDonald, Steven Montgomery, Pawan Pandoh, Anca Petrescu, Gord Robertson, Jacquie Schein, Asim Siddiqui, Duane Smailus, Jeff Stott and George Yang for scientific expertise, lab and bioinformatics efforts. We also would like to thank Kirk Schoeffel, Mark Mayo and Bernard Li for their system administration advice.
Command-Line Bioinformatics
Let's do some bioinformatics using bash and a few binaries out of /bin and /usr/bin. We will calculate the GC ratio of the Tor2/SARS genome—the fraction of base pairs that are either a G or a C. Let's avoid using awk to make things interesting. First, download the sequence with wget, using -q to silence its verbose output:
> wget -q http://mkweb.bcgsc.ca/sars/AY274119.fa > head AY274119.fa gi|30248028|gb|AY274119.3| SARS coronavirus TOR2 ATATTAGGTTTTTACCTACCCAGGA...
The sequence file is in FASTA format consisting of a header line and the sequence, split into fixed-width lines. The following counts the number of Gs and Cs in the sequence and presents the total as a fraction of the total number of bases:
> grep -v "^>" AY274119.fa | fold -w 1 | tr "ATGC" "..xx" | sort | uniq -c | sed 's/[^0-9]//g' | t -s "\012" " " | sed 's/\([0-9]*\) \([0-9]*\)/scale = 3; ↪\2 \/ (\1+\2)/' | bc -i scale = 3; 12127 / (17624+12127) .407
Out of the 29,751 bases in our sequence, 12,127 are either G or C, giving a GC content of 41%.
GSC MySQL LIMS
We collected 3,250 sequencing reads containing 2.1 million quality base pairs contributing toward the initial draft assembly. This represented roughly 70X redundant coverage of the genome. WGS is usually done to no more than 10X, but for us, time was of the essence, and we wanted to avoid delays associated with finishing regions that were not fully covered by the first round of sequencing.
SELECT SUM(Sequence_Length) AS bp_tot, AVG(Quality_Length) AS bpq_avg, SUM(Quality_Length) AS bp_qual_tot, COUNT(Well) AS reads, Sequence_DateTime AS date, Equipment_Name AS equip FROM Equipment, Clone_Sequence, Sequence_Batch, Sequence, Plate, Library, Project WHERE FK_Sequence_Batch__ID=Sequence_Batch_ID AND FK_Plate__ID=Plate_ID AND FK_Library__Name=Library_Name AND FK_Equipment__ID=Equipment_ID AND FK_Project__ID=Project_ID AND FK_Sequence__ID=Sequence_ID AND Sequence_Subdirectory like "SARS2%" AND Quality_Length > 100 AND Sequence_DateTime < "20030413" GROUP BY Sequence_ID ORDER BY Sequence_DateTime; bp_tot bpq_avg bp_tot reads date equip 437256 612.6399 205847 336 2003-04-11 21:07:06 SARS212.B21 D3730-3 412366 752.1074 245187 326 2003-04-11 22:15:34 SARS213.B21 D3730-1 269456 639.1926 225635 353 2003-04-11 22:22:34 SARS215.B21 D3700-6 130525 715.5060 118774 166 2003-04-11 22:25:44 SARS216.B21 D3700-5 282490 682.6311 249843 366 2003-04-11 22:27:14 SARS215.BR D3700-4 310119 612.7601 212015 346 2003-04-11 22:31:56 SARS213.BR D3700-1 182573 681.4975 136981 201 2003-04-11 22:36:40 SARS216.BR D3700-3 301471 642.2273 226064 352 2003-04-12 01:58:16 SARS212.BR D3700-2 401595 690.5204 220276 319 2003-04-12 05:13:26 SARS211.BR D3730-3 460100 642.0468 219580 342 2003-04-12 06:20:52 SARS214.BR D3730-2 182360 471.7832 67465 143 2003-04-12 07:14:44 SARS214.B21 D3730-1
Resources
Growth of Genbank: www.ncbi.nlm.nih.gov/Genbank/genbankstats.html
How Perl Saved the Human Genome Project: bioperl.org/GetStarted/tpj_ls_bio.html
Image of a Coronavirus: www3.btwebworld.com/vdg/gallery/Coronavirus.jpg
SARS Issue of Science: www.sciencemag.org/feature/data/sars
SARS Statistics and Information: www.cdc.gov/ncidod/sars and lassesen.com/sars
Timeline of SARS History: www.worldhistory.com/sars.htm
UCSC Assembly of Human Genome: www.cse.ucsc.edu/~learithe/browser/goldenPath/algo.html
Martin Krzywinski (martink@bcgsc.ca) is a bioinformatics research scientist at Canada's Michael Smith Genome Sciences Centre. He spends his time applying Perl to problems in physical mapping and data-processing automation. In his spare time he can be found encouraging his cat to stick to her diet.
Yaron Butterfield (ybutterf@bcgsc.ca) leads the sequencing bioinformatics team at Canada's Michael Smith Genome Sciences Centre. He and his group develop DNA sequence analysis and visualization software and pipelines for various genome and cancer-based research projects.
Realizing the promise of Apache® Hadoop® requires the effective deployment of compute, memory, storage and networking to achieve optimal results. With its flexibility and multitude of options, it is easy to over or under provision the server infrastructure, resulting in poor performance and high TCO. Join us for an in depth, technical discussion with industry experts from leading Hadoop and server companies who will provide insights into the key considerations for designing and deploying an optimal Hadoop cluster.
Sponsored by AMD
Built-in forensics, incident response, and security with Red Hat Enterprise Linux 6
Every security policy provides guidance and requirements for ensuring adequate protection of information and data, as well as high-level technical and administrative security requirements for a system in a given environment. Traditionally, providing security for a system focuses on the confidentiality of the information on it. However, protecting the data integrity and system and data availability is just as important. For example, when processing United States intelligence information, there are three attributes that require protection: confidentiality, integrity, and availability.
Learn more about catching the bad guy in this free white paper.
Sponsored by DLT Solutions
| Designing Electronics with Linux | May 22, 2013 |
| Dynamic DNS—an Object Lesson in Problem Solving | May 21, 2013 |
| Using Salt Stack and Vagrant for Drupal Development | May 20, 2013 |
| Making Linux and Android Get Along (It's Not as Hard as It Sounds) | May 16, 2013 |
| Drupal Is a Framework: Why Everyone Needs to Understand This | May 15, 2013 |
| Home, My Backup Data Center | May 13, 2013 |
- Designing Electronics with Linux
- New Products
- Making Linux and Android Get Along (It's Not as Hard as It Sounds)
- Linux Systems Administrator
- Dynamic DNS—an Object Lesson in Problem Solving
- Senior Perl Developer
- Technical Support Rep
- UX Designer
- Web & UI Developer (JavaScript & j Query)
- Using Salt Stack and Vagrant for Drupal Development
- Reply to comment | Linux Journal
4 hours 56 min ago - Dynamic DNS
5 hours 30 min ago - Reply to comment | Linux Journal
6 hours 28 min ago - Reply to comment | Linux Journal
7 hours 19 min ago - Not free anymore
11 hours 20 min ago - Great
15 hours 8 min ago - Reply to comment | Linux Journal
15 hours 16 min ago - Understanding the Linux Kernel
17 hours 30 min ago - General
20 hours 35 sec ago - Kernel Problem
1 day 6 hours ago
Enter to Win an Adafruit Pi Cobbler Breakout Kit for Raspberry Pi

It's Raspberry Pi month at Linux Journal. Each week in May, Adafruit will be giving away a Pi-related prize to a lucky, randomly drawn LJ reader. Winners will be announced weekly.
Fill out the fields below to enter to win this week's prize-- a Pi Cobbler Breakout Kit for Raspberry Pi.
Congratulations to our winners so far:
- 5-8-13, Pi Starter Pack: Jack Davis
- 5-15-13, Pi Model B 512MB RAM: Patrick Dunn
- 5-21-13, Prototyping Pi Plate Kit: Philip Kirby
- Next winner announced on 5-27-13!
Featured Jobs
| Linux Systems Administrator | Houston and Austin, Texas | Host Gator |
| Senior Perl Developer | Austin, Texas | Host Gator |
| Technical Support Rep | Houston and Austin, Texas | Host Gator |
| UX Designer | Austin, Texas | Host Gator |
| Web & UI Developer (JavaScript & j Query) | Austin, Texas | Host Gator |
Free Webinar: Hadoop
How to Build an Optimal Hadoop Cluster to Store and Maintain Unlimited Amounts of Data Using Microservers
Realizing the promise of Apache® Hadoop® requires the effective deployment of compute, memory, storage and networking to achieve optimal results. With its flexibility and multitude of options, it is easy to over or under provision the server infrastructure, resulting in poor performance and high TCO. Join us for an in depth, technical discussion with industry experts from leading Hadoop and server companies who will provide insights into the key considerations for designing and deploying an optimal Hadoop cluster.
Some of key questions to be discussed are:
- What is the “typical” Hadoop cluster and what should be installed on the different machine types?
- Why should you consider the typical workload patterns when making your hardware decisions?
- Are all microservers created equal for Hadoop deployments?
- How do I plan for expansion if I require more compute, memory, storage or networking?




Comments
Silly move. Your loss.
Silly move.
Your loss.