Sequencing the SARS Virus
The authors would like to thank Marco Marra, Steven Jones, Caroline Astell, Rob Holt, Angela Brooks-Wilson, Jas Khattra, Jennifer Asano, Sarah Barber, Susanna Chan, Allison Cloutier, Sean Coughlin, Doug Freeman, Noreen Girn, Obi Griffith, Steve Leach, Mike Mayo, Helen McDonald, Steven Montgomery, Pawan Pandoh, Anca Petrescu, Gord Robertson, Jacquie Schein, Asim Siddiqui, Duane Smailus, Jeff Stott and George Yang for scientific expertise, lab and bioinformatics efforts. We also would like to thank Kirk Schoeffel, Mark Mayo and Bernard Li for their system administration advice.
Let's do some bioinformatics using bash and a few binaries out of /bin and /usr/bin. We will calculate the GC ratio of the Tor2/SARS genome—the fraction of base pairs that are either a G or a C. Let's avoid using awk to make things interesting. First, download the sequence with wget, using -q to silence its verbose output:
> wget -q http://mkweb.bcgsc.ca/sars/AY274119.fa > head AY274119.fa gi|30248028|gb|AY274119.3| SARS coronavirus TOR2 ATATTAGGTTTTTACCTACCCAGGA...
The sequence file is in FASTA format consisting of a header line and the sequence, split into fixed-width lines. The following counts the number of Gs and Cs in the sequence and presents the total as a fraction of the total number of bases:
> grep -v "^>" AY274119.fa | fold -w 1 | tr "ATGC" "..xx" | sort | uniq -c | sed 's/[^0-9]//g' | t -s "\012" " " | sed 's/\([0-9]*\) \([0-9]*\)/scale = 3; ↪\2 \/ (\1+\2)/' | bc -i scale = 3; 12127 / (17624+12127) .407
Out of the 29,751 bases in our sequence, 12,127 are either G or C, giving a GC content of 41%.
GSC MySQL LIMS
We collected 3,250 sequencing reads containing 2.1 million quality base pairs contributing toward the initial draft assembly. This represented roughly 70X redundant coverage of the genome. WGS is usually done to no more than 10X, but for us, time was of the essence, and we wanted to avoid delays associated with finishing regions that were not fully covered by the first round of sequencing.
SELECT SUM(Sequence_Length) AS bp_tot, AVG(Quality_Length) AS bpq_avg, SUM(Quality_Length) AS bp_qual_tot, COUNT(Well) AS reads, Sequence_DateTime AS date, Equipment_Name AS equip FROM Equipment, Clone_Sequence, Sequence_Batch, Sequence, Plate, Library, Project WHERE FK_Sequence_Batch__ID=Sequence_Batch_ID AND FK_Plate__ID=Plate_ID AND FK_Library__Name=Library_Name AND FK_Equipment__ID=Equipment_ID AND FK_Project__ID=Project_ID AND FK_Sequence__ID=Sequence_ID AND Sequence_Subdirectory like "SARS2%" AND Quality_Length > 100 AND Sequence_DateTime < "20030413" GROUP BY Sequence_ID ORDER BY Sequence_DateTime; bp_tot bpq_avg bp_tot reads date equip 437256 612.6399 205847 336 2003-04-11 21:07:06 SARS212.B21 D3730-3 412366 752.1074 245187 326 2003-04-11 22:15:34 SARS213.B21 D3730-1 269456 639.1926 225635 353 2003-04-11 22:22:34 SARS215.B21 D3700-6 130525 715.5060 118774 166 2003-04-11 22:25:44 SARS216.B21 D3700-5 282490 682.6311 249843 366 2003-04-11 22:27:14 SARS215.BR D3700-4 310119 612.7601 212015 346 2003-04-11 22:31:56 SARS213.BR D3700-1 182573 681.4975 136981 201 2003-04-11 22:36:40 SARS216.BR D3700-3 301471 642.2273 226064 352 2003-04-12 01:58:16 SARS212.BR D3700-2 401595 690.5204 220276 319 2003-04-12 05:13:26 SARS211.BR D3730-3 460100 642.0468 219580 342 2003-04-12 06:20:52 SARS214.BR D3730-2 182360 471.7832 67465 143 2003-04-12 07:14:44 SARS214.B21 D3730-1
Growth of Genbank: www.ncbi.nlm.nih.gov/Genbank/genbankstats.html
How Perl Saved the Human Genome Project: bioperl.org/GetStarted/tpj_ls_bio.html
Image of a Coronavirus: www3.btwebworld.com/vdg/gallery/Coronavirus.jpg
SARS Issue of Science: www.sciencemag.org/feature/data/sars
Timeline of SARS History: www.worldhistory.com/sars.htm
UCSC Assembly of Human Genome: www.cse.ucsc.edu/~learithe/browser/goldenPath/algo.html
Martin Krzywinski (email@example.com) is a bioinformatics research scientist at Canada's Michael Smith Genome Sciences Centre. He spends his time applying Perl to problems in physical mapping and data-processing automation. In his spare time he can be found encouraging his cat to stick to her diet.
Yaron Butterfield (firstname.lastname@example.org) leads the sequencing bioinformatics team at Canada's Michael Smith Genome Sciences Centre. He and his group develop DNA sequence analysis and visualization software and pipelines for various genome and cancer-based research projects.
Fast/Flexible Linux OS Recovery
On Demand Now
In this live one-hour webinar, learn how to enhance your existing backup strategies for complete disaster recovery preparedness using Storix System Backup Administrator (SBAdmin), a highly flexible full-system recovery solution for UNIX and Linux systems.
Join Linux Journal's Shawn Powers and David Huffman, President/CEO, Storix, Inc.
Free to Linux Journal readers.Register Now!
- Server Hardening
- May 2016 Issue of Linux Journal
- EnterpriseDB's EDB Postgres Advanced Server and EDB Postgres Enterprise Manager
- The Humble Hacker?
- The US Government and Open-Source Software
- The Death of RoboVM
- BitTorrent Inc.'s Sync
- Open-Source Project Secretly Funded by CIA
- New Container Image Standard Promises More Portable Apps
- ACI Worldwide's UP Retail Payments
In modern computer systems, privacy and security are mandatory. However, connections from the outside over public networks automatically imply risks. One easily available solution to avoid eavesdroppers’ attempts is SSH. But, its wide adoption during the past 21 years has made it a target for attackers, so hardening your system properly is a must.
Additionally, in highly regulated markets, you must comply with specific operational requirements, proving that you conform to standards and even that you have included new mandatory authentication methods, such as two-factor authentication. In this ebook, I discuss SSH and how to configure and manage it to guarantee that your network is safe, your data is secure and that you comply with relevant regulations.Get the Guide