Sequencing the SARS Virus
The authors would like to thank Marco Marra, Steven Jones, Caroline Astell, Rob Holt, Angela Brooks-Wilson, Jas Khattra, Jennifer Asano, Sarah Barber, Susanna Chan, Allison Cloutier, Sean Coughlin, Doug Freeman, Noreen Girn, Obi Griffith, Steve Leach, Mike Mayo, Helen McDonald, Steven Montgomery, Pawan Pandoh, Anca Petrescu, Gord Robertson, Jacquie Schein, Asim Siddiqui, Duane Smailus, Jeff Stott and George Yang for scientific expertise, lab and bioinformatics efforts. We also would like to thank Kirk Schoeffel, Mark Mayo and Bernard Li for their system administration advice.
Let's do some bioinformatics using bash and a few binaries out of /bin and /usr/bin. We will calculate the GC ratio of the Tor2/SARS genome—the fraction of base pairs that are either a G or a C. Let's avoid using awk to make things interesting. First, download the sequence with wget, using -q to silence its verbose output:
> wget -q http://mkweb.bcgsc.ca/sars/AY274119.fa > head AY274119.fa gi|30248028|gb|AY274119.3| SARS coronavirus TOR2 ATATTAGGTTTTTACCTACCCAGGA...
The sequence file is in FASTA format consisting of a header line and the sequence, split into fixed-width lines. The following counts the number of Gs and Cs in the sequence and presents the total as a fraction of the total number of bases:
> grep -v "^>" AY274119.fa | fold -w 1 | tr "ATGC" "..xx" | sort | uniq -c | sed 's/[^0-9]//g' | t -s "\012" " " | sed 's/\([0-9]*\) \([0-9]*\)/scale = 3; ↪\2 \/ (\1+\2)/' | bc -i scale = 3; 12127 / (17624+12127) .407
Out of the 29,751 bases in our sequence, 12,127 are either G or C, giving a GC content of 41%.
GSC MySQL LIMS
We collected 3,250 sequencing reads containing 2.1 million quality base pairs contributing toward the initial draft assembly. This represented roughly 70X redundant coverage of the genome. WGS is usually done to no more than 10X, but for us, time was of the essence, and we wanted to avoid delays associated with finishing regions that were not fully covered by the first round of sequencing.
SELECT SUM(Sequence_Length) AS bp_tot, AVG(Quality_Length) AS bpq_avg, SUM(Quality_Length) AS bp_qual_tot, COUNT(Well) AS reads, Sequence_DateTime AS date, Equipment_Name AS equip FROM Equipment, Clone_Sequence, Sequence_Batch, Sequence, Plate, Library, Project WHERE FK_Sequence_Batch__ID=Sequence_Batch_ID AND FK_Plate__ID=Plate_ID AND FK_Library__Name=Library_Name AND FK_Equipment__ID=Equipment_ID AND FK_Project__ID=Project_ID AND FK_Sequence__ID=Sequence_ID AND Sequence_Subdirectory like "SARS2%" AND Quality_Length > 100 AND Sequence_DateTime < "20030413" GROUP BY Sequence_ID ORDER BY Sequence_DateTime; bp_tot bpq_avg bp_tot reads date equip 437256 612.6399 205847 336 2003-04-11 21:07:06 SARS212.B21 D3730-3 412366 752.1074 245187 326 2003-04-11 22:15:34 SARS213.B21 D3730-1 269456 639.1926 225635 353 2003-04-11 22:22:34 SARS215.B21 D3700-6 130525 715.5060 118774 166 2003-04-11 22:25:44 SARS216.B21 D3700-5 282490 682.6311 249843 366 2003-04-11 22:27:14 SARS215.BR D3700-4 310119 612.7601 212015 346 2003-04-11 22:31:56 SARS213.BR D3700-1 182573 681.4975 136981 201 2003-04-11 22:36:40 SARS216.BR D3700-3 301471 642.2273 226064 352 2003-04-12 01:58:16 SARS212.BR D3700-2 401595 690.5204 220276 319 2003-04-12 05:13:26 SARS211.BR D3730-3 460100 642.0468 219580 342 2003-04-12 06:20:52 SARS214.BR D3730-2 182360 471.7832 67465 143 2003-04-12 07:14:44 SARS214.B21 D3730-1
Growth of Genbank: www.ncbi.nlm.nih.gov/Genbank/genbankstats.html
How Perl Saved the Human Genome Project: bioperl.org/GetStarted/tpj_ls_bio.html
Image of a Coronavirus: www3.btwebworld.com/vdg/gallery/Coronavirus.jpg
SARS Issue of Science: www.sciencemag.org/feature/data/sars
Timeline of SARS History: www.worldhistory.com/sars.htm
UCSC Assembly of Human Genome: www.cse.ucsc.edu/~learithe/browser/goldenPath/algo.html
Martin Krzywinski (firstname.lastname@example.org) is a bioinformatics research scientist at Canada's Michael Smith Genome Sciences Centre. He spends his time applying Perl to problems in physical mapping and data-processing automation. In his spare time he can be found encouraging his cat to stick to her diet.
Yaron Butterfield (email@example.com) leads the sequencing bioinformatics team at Canada's Michael Smith Genome Sciences Centre. He and his group develop DNA sequence analysis and visualization software and pipelines for various genome and cancer-based research projects.
- Epiq Solutions' Sidekiq M.2
- Android Browser Security--What You Haven't Been Told
- Readers' Choice Awards 2013
- The Many Paths to a Solution
- Nativ Disc
- Download "Linux Management with Red Hat Satellite: Measuring Business Impact and ROI"
- Synopsys' Coverity
- Writing a Simple USB Driver
- Securing the Programmer
- Downloading an Entire Web Site with wget
With all the industry talk about the benefits of Linux on Power and all the performance advantages offered by its open architecture, you may be considering a move in that direction. If you are thinking about analytics, big data and cloud computing, you would be right to evaluate Power. The idea of using commodity x86 hardware and replacing it every three years is an outdated cost model. It doesn’t consider the total cost of ownership, and it doesn’t consider the advantage of real processing power, high-availability and multithreading like a demon.
This ebook takes a look at some of the practical applications of the Linux on Power platform and ways you might bring all the performance power of this open architecture to bear for your organization. There are no smoke and mirrors here—just hard, cold, empirical evidence provided by independent sources. I also consider some innovative ways Linux on Power will be used in the future.Get the Guide