Use Python for Scientific Computing
As computers become more and more powerful, scientific computing is becoming a more important part of fundamental research into how our world works. We can do more now than we could even imagine just a mere decade ago.
Most of this work has been done traditionally in more low-level languages, such as C or FORTRAN. Originally, this was done in order to maximize the efficiency of the code and to squeeze out every last bit of work from the computer. With computers now reaching multi-GHz speeds, this is no longer the bottleneck it once was. Other efficiencies come into play, with programmer efficiency being paramount. With this in mind, other languages are being considered that help make the most of a progammer's time and effort.
This article discusses one of these options: Python. Although Python is an interpreted language and suffers, unjustly, from the stigma that entails, it is growing in popularity among scientists for its clarity of style and the availability of many useful packages. The packages I look at in this article specifically are designed to provide fast, robust mathematical and scientific tools that can run nearly as fast as C or FORTRAN code.
The packages I focus on here are called numpy and scipy. They are both available from the main SciPy site (see Resources). But before we download them, what exactly are numpy and scipy?
numpy is a Python package that provides extended math capabilities. These include new data types, such as long integers of unlimited size and complex numbers. It also provides a new array data type that allows for the construction of vectors and matrices. All the basic operations that can be applied to these new data types also are included. With this we can get quite a bit of scientific work done already.
scipy is a further extension built on top of numpy. This package simplifies a lot of the more-common tasks that need to be handled, including tools such as those used to find the roots of polynomials, doing Fourier transformations, doing numerical integrals and enhanced I/O. With these functions, a user can develop very sophisticated scientific applications in relatively short order.
Now that we know what numpy and scipy are, how do we get them and start using them? Most distributions include both of these packages, making this the easy way to install them. Simply use your distribution's package manager to do the install. For example, in Ubuntu, you would type the following in a terminal window:
sudo apt-get install python-scipy
This installs scipy and all of its dependencies.
If you want to use the latest-and-greatest version and don't want to wait for your distribution to get updated, they are available through Subversion. Simply execute the following:
svn co http://svn.scipy.org/svn/numpy/trunk numpy svn co http://svn.scipy.org/svn/scipy/trunk scipy
Building and installing is handled by a setup.py script in the source directory. For most people, building and installing simply requires:
python setup.py build python setup.py install # done as root
If you don't have root access, or don't want to install into the system package directory, you can install into a different directory using:
python setup.py install --prefix=/path/to/install/dir
Other options also are available, which you can find out about by using:
python setup.py --help-commands
Take time to experiment and see whether you can use any of the extra options in your specific case.
Now that we have scipy and numpy installed, let's begin our tour by looking at some of the basic functions that are often used in scientific calculations. One of the most common tasks is matrix mathematics. This is greatly simplified when you use numpy. The most basic code to do a multiplication of two matrices using numpy would look like this:
import numpy a1=numpy.empty((500,500)) a2=numpy.empty((500,500)) a3=a1*a2
Contrast this to what we would need to write if we did it in C:
#include <stdlib.h>
int main() {
double a1[500][500];
double a2[500][500];
double a3[500][500];
int i, j, k;
for (i=0; i<500; i++) {
for (j=0; j<500; j++) {
a3[i][j] = 0;
for (k=0; k<500; k++) {
a3[i][j] += a1[i][k] * a2[k][j];
}
}
}
}
The Python code is much shorter and cleaner, and the intent of the code is much clearer. This kind of clarity in the code means that the programmer can focus much more on the algorithm rather than the gritty details of the implementation. There are C libraries, such as LAPACK, which help simplify this work in C. But, even these libraries can't match the simplicity of scipy.
“But what about efficiency?”, I hear you ask. Well, let's take a look at it with some timed runs. Taking our above example, we can put some calls around the actual matrix multiplication part and see how long each one takes. See Table 1 for the results.
Although your mileage will vary, because these times depend on your hardware and what other programs also are running on your machine, we can see a general trend. The Python code actually was about eight times faster than the C code compiled with no command-line options. That is actually quite surprising. Once we use the optimization command-line option, we see that the C code is now faster, by a factor of approximately 25. So, we can get faster code using optimized C, but we need to realize that multiplying two matrices with 250,000 elements each in one-quarter of a second is probably fast enough.
As well, we get a certain amount of protection when we use Python. What happens if we try to multiply two matrices where such multiplication doesn't make sense mathematically? When we try to multiply two matrices of different sizes, Python gives us:
ValueError: shape mismatch: objects cannot be broadcast to a single shape
In C, we get no error at all. This due to the fact that when we work with matrices, we actually are using pointer arithmetic. So pretty much anything we do is valid C, even if it makes no sense in the problem domain.
We also can work just as easily with complex numbers. If we wanted to create an array of 64-bit complex numbers, we would write:
a=zeros((500,500), dtype=complex64)
Joey Bernard has a background in both physics and computer science. This serves him well in his day job as a computational research consultant at the University of New Brunswick. He also teaches computational physics and parallel programming.
Today’s modular x86 servers are compute-centric, designed as a least common denominator to support a wide range of IT workloads. Those generic, virtualized IT workloads have much different resource optimization requirements than hyperscale and cloud applications. They have resulted in a “one size fits all” enterprise IT architecture that is not optimized for a specific set of IT workloads, and especially not emerging hyperscale workloads, such as web applications, big data, and object storage. In this report, you will learn how shifting the focus from traditional compute-centric IT architectures to an innovative disaggregated fabric-based architecture can optimize and scale your data center.
Sponsored by AMD
Built-in forensics, incident response, and security with Red Hat Enterprise Linux 6
Every security policy provides guidance and requirements for ensuring adequate protection of information and data, as well as high-level technical and administrative security requirements for a system in a given environment. Traditionally, providing security for a system focuses on the confidentiality of the information on it. However, protecting the data integrity and system and data availability is just as important. For example, when processing United States intelligence information, there are three attributes that require protection: confidentiality, integrity, and availability.
Learn more about catching the bad guy in this free white paper.
Sponsored by DLT Solutions
| Using Salt Stack and Vagrant for Drupal Development | May 20, 2013 |
| Making Linux and Android Get Along (It's Not as Hard as It Sounds) | May 16, 2013 |
| Drupal Is a Framework: Why Everyone Needs to Understand This | May 15, 2013 |
| Home, My Backup Data Center | May 13, 2013 |
| Non-Linux FOSS: Seashore | May 10, 2013 |
| Trying to Tame the Tablet | May 08, 2013 |
- Using Salt Stack and Vagrant for Drupal Development
- Making Linux and Android Get Along (It's Not as Hard as It Sounds)
- New Products
- Validate an E-Mail Address with PHP, the Right Way
- Drupal Is a Framework: Why Everyone Needs to Understand This
- A Topic for Discussion - Open Source Feature-Richness?
- The Pari Package On Linux
- Home, My Backup Data Center
- New Products
- Developer Poll
- This is the easiest tutorial
4 hours 3 min ago - Ahh, the Koolaid.
9 hours 41 min ago - git-annex assistant
15 hours 41 min ago - direct cable connection
16 hours 3 min ago - Agreed on AirDroid. With my
16 hours 14 min ago - I just learned this
16 hours 18 min ago - enterprise
16 hours 48 min ago - not living upto the mobile revolution
19 hours 39 min ago - Deceptive Advertising and
20 hours 15 min ago - Let\'s declare that you have
20 hours 16 min ago
Enter to Win an Adafruit Prototyping Pi Plate Kit for Raspberry Pi

It's Raspberry Pi month at Linux Journal. Each week in May, Adafruit will be giving away a Pi-related prize to a lucky, randomly drawn LJ reader. Winners will be announced weekly.
Fill out the fields below to enter to win this week's prize-- a Prototyping Pi Plate Kit for Raspberry Pi.
Congratulations to our winners so far:
- 5-8-13, Pi Starter Pack: Jack Davis
- 5-15-13, Pi Model B 512MB RAM: Patrick Dunn
- Next winner announced on 5-21-13!
Free Webinar: Linux Backup and Recovery
Most companies incorporate backup procedures for critical data, which can be restored quickly if a loss occurs. However, fewer companies are prepared for catastrophic system failures, in which they lose all data, the entire operating system, applications, settings, patches and more, reducing their system(s) to “bare metal.” After all, before data can be restored to a system, there must be a system to restore it to.
In this one hour webinar, learn how to enhance your existing backup strategies for better disaster recovery preparedness using Storix System Backup Administrator (SBAdmin), a highly flexible bare-metal recovery solution for UNIX and Linux systems.




Comments
The C code
I had the same thought as Rob Hooft about the timing when I read this article, and came here to comment. As I suspected, main() gets optimized away to nothing because the result is never used. I typed in the code and compiled it with gcc 4.3 -O3 -fverbose-asm -S:
objdump -d matmul
00000000004004b0 :
4004b0: 31 c0 xor %eax,%eax
4004b2: c3 retq
4004b3: 90 nop
4004b4: 90 nop
...
(I added a return 0 to main(), hence clearing the return value register with xor.)
0.01 seconds is typical for program that just return true without doing anything at all (other than dynamically linking libc, and other libc initialization, at startup, etc). On my system, (Ubuntu Intrepid with Linux 2.6.27, Ubuntu's "generic" AMD64 kernel image on a C2D: E6600 DDR2-800 g965), the program compiled with gcc -O3 takes 0.001 seconds. The reported time has more to do with kernel time resolution than anything else.
Berthold is also correct, although I didn't know what he meant at first, since I wasn't very familiar with numpy. In numpy, a1 * a2 is an element-wise product: in C:
a3[i][j] = a1[i][j] * a2[i][j];
instead of the inner loop over k. This is not the same as what C = A * B means in usual mathematical notation, hence the confusion. There, it means the matrix product, which is what the C routine calculates, and what dgemm() from BLAS calculates. In numpy, that is numpy.dot(a1, a2).
If you don't want gcc (or any other compiler) to optimize away a calculation, you have to use the result. Either with gcc tricks like __attribute__ ((used)) or by passing the address of the output array to a function that the compiler can't see while it's compiling the function you're testing. Calling a function in a file that you compile separately will defeat any optimizer except e.g. Sun Studio's cross-file optimizer mode that puts extra information in the .o files...
matmul.c:
/* usemem.c contains: void usemem(void*p){} * gcc won't do cross-file inlining/optimizations when you don't * compile both files at the same time. Note that -O implies -funit-at-a-time */ void usemem(void *); #define ASIZE 5000 // macro this so you can redefine it for other compilers. #define DECLARE_ALIGNED( var, n ) var __attribute__((aligned(n))) static double DECLARE_ALIGNED(a1[ASIZE][ASIZE], 128); static double DECLARE_ALIGNED(a2[ASIZE][ASIZE], 128); static __attribute__ ((used, aligned(128))) double a3[ASIZE][ASIZE]; int main(){ int i,j; for(i=0; i< ASIZE ; i++){ for(j=0; j< ASIZE ; j++){ double tmp = 0; // allows more optimization than referencing a3[i][j] repeatedly. Really. C is like that. #ifdef ELEMENTWISE tmp = a1[i][j] * a2[i][j]; #else int k; for(k=0; k< ASIZE ; k++){ tmp += a1[i][k] * a2[k][j]; } #endif a3[i][j] = tmp; } } // usemem(a3); return 0; }matmul.py:
results:
gcc 4.3's auto-vectorized (SSE2) version is twice as fast as gcc 3.4's scalar version on a large array like this that doesn't fit in cache (process RSS ~= 450MB). gcc4 vectorizes the loop to process two columns at once, so it only triggers half the cache misses of a direct interpretation of the C code. going down a colum in a row-major array is slow, because each successive element is in a new cache line. Optimized libraries, like ATLAS's BLAS implementation, which numpy uses, can reduce memory access costs to ~n^2 instead of ~n^3, by working in blocks that fit in the cache. GCC's vectorization illustrates that there's a lot to be gained from improved memory access patterns. (it's also a win when you're FPU limited, on small cached arrays, but trust me, it's the memory access that's giving the factor of 2 speedup with an array of 5000. Use oprofile yourself if you want to see.)
So numpy vs. naive C is a huge win if your arrays are not tiny, since even a few decent-sized naive matmuls will dominate your run time regardless of language.
BTW, I used to work at Dalhousie as a Linux cluster sysadmin and all around geek (with the phylogenetics group), where I had heard of ACEnet, where Joey works. Hi!
Error in exaple
Joey,
Your runtime comparison does compare apple with oranges. The C codes does not compute a1*a2 but numpy.dot(a1, a2) in numpy terms.
Regards
Berthold
C optimization
Joey,
Thanks for a nice article expressing a lot of my feelings regarding the comparison of C/Fortran and Python.
I do have one remark: your 250000 element matrix multiplication requires 2x500^3 = 250M floating point operations. You are suggesting that the C program performed those in 0.01 second. That requires 25Gflops and a similar amount of integer operations, simultaneously. This is more than most of us have available in their desktops. I think this shows that "gcc -O3" optimized your entire block of code out of the program, leaving just the startup and termination of the program.
In fact, the Python program you show here performs very close to the optimum, since the total interpreter overhead is only a handful of lines of code. It is in small arrays, like 3x3 matrix multiplications, that C can be significantly faster than Python.