Use Python for Scientific Computing
which would give us a 500x500 element matrix initialized with zeros. We access the real and imaginary parts of each element using:
a.real[0,0]=1.0 a.imag[0,0]=2.0
which would set the value 1+2j into the [0,0] element.
There also are functions to give us more complicated results. These include dot products, inner products, outer products, inverses, transposes, traces and so forth. Needless to say, we have a great deal of tools at our disposal to do a fair amount of science already. But is that enough? Of course not.
Now that we can do some math, how do we get some “real” science done? This is where we start using the features of our second package of interest, scipy. With this package, we have quite a few more functions available to do some fairly sophisticated computational science. Let's look at an example of simple data analysis to show what kind of work is possible.
Let's assume you've collected some data and want to see what form this data has, whether there is any periodicity. The following code lets us do that:
import scipy
inFile = file('input.txt', r)
inArray = scipy.io.read_array(inFile)
outArray = fft(inArray)
outFile = file('output.txt', w)
scipy.io.write_array(outFile, outArray)
As you can see, reading in the data is a one-liner. In this example, we use the FFT functions to convert the signal to the frequency domain. This lets us see the spread of frequencies in the data. The equivalent C or FORTRAN code is simply too large to include here.
But, what if we want to look at this data to see whether there is anything interesting? Luckily, there is another package, called matplotlib, which can be used to generate graphics for this very purpose. If we generate a sine wave and pass it through an FFT, we can see what form this data has by graphing it (Figures 1 and 2).
We see that the sine wave looks regular, and the FFT confirms this by having a single peak at the frequency of the sine wave. We just did some basic data analysis.
This shows us how easy it is to do fairly sophisticated scientific programming. And, if we use an interactive Python environment, we can do this kind of scientific analysis in an exploratory way, allowing us to experiment on our data in near real time.
Luckily for us, the people at the SciPy Project have thought of this and have given us the program ipython. This also is available at the main SciPy site. ipython has been written to work with scipy, numpy and matplotlib in a very seamless way. To execute it with matplotlib support, type:
ipython -pylab
The interface is a simple ASCII one, as shown in Figure 3.
If we use it to plot the sine wave from above, it simply pops up a display window to draw in the plot (Figure 4).
The plot window allows you to save your brilliant graphs and plots, so you can show the entire world your scientific breakthrough. All of the plots for this article actually were generated this way.
So, we've started to do some real computational science and some basic data analysis. What do we do next? Why, we go bigger, of course.
So far, we have looked at relatively small data sets and relatively straightforward computations. But, what if we have really large amounts of data, or we have a much more complex analysis we would like to run? We can take advantage of parallelism and run our code on a high-performance computing cluster.
The good people at the SciPy site have written another module called mpi4py. This module provides a Python implementation of the MPI standard. With it, we can write message-passing programs. It does require some work to install, however.
The first step is to install an MPI implementation for your machine (such as MPICH, OpenMPI or LAM). Most distributions have packages for MPI, so that's the easiest way to install it. Then, you can build and install mpi4py the usual way with the following:
python setup.py build python setup.py install
To test it, execute:
mpirun -np 5 python tests/helloworld.py
Joey Bernard has a background in both physics and computer science. This serves him well in his day job as a computational research consultant at the University of New Brunswick. He also teaches computational physics and parallel programming.
Realizing the promise of Apache® Hadoop® requires the effective deployment of compute, memory, storage and networking to achieve optimal results. With its flexibility and multitude of options, it is easy to over or under provision the server infrastructure, resulting in poor performance and high TCO. Join us for an in depth, technical discussion with industry experts from leading Hadoop and server companies who will provide insights into the key considerations for designing and deploying an optimal Hadoop cluster.
Sponsored by AMD
If you already use virtualized infrastructure, you are well on your way to leveraging the power of the cloud. Virtualization offers the promise of limitless resources, but how do you manage that scalability when your DevOps team doesn’t scale? In today’s hypercompetitive markets, fast results can make a difference between leading the pack vs. obsolescence. Organizations need more benefits from cloud computing than just raw resources. They need agility, flexibility, convenience, ROI, and control.
Stackato private Platform-as-a-Service technology from ActiveState extends your private cloud infrastructure by creating a private PaaS to provide on-demand availability, flexibility, control, and ultimately, faster time-to-market for your enterprise.
Sponsored by ActiveState
| Non-Linux FOSS: libnotify, OS X Style | Jun 18, 2013 |
| Containers—Not Virtual Machines—Are the Future Cloud | Jun 17, 2013 |
| Lock-Free Multi-Producer Multi-Consumer Queue on Ring Buffer | Jun 12, 2013 |
| Weechat, Irssi's Little Brother | Jun 11, 2013 |
| One Tail Just Isn't Enough | Jun 07, 2013 |
| Introduction to MapReduce with Hadoop on Linux | Jun 05, 2013 |
- Containers—Not Virtual Machines—Are the Future Cloud
- Non-Linux FOSS: libnotify, OS X Style
- Validate an E-Mail Address with PHP, the Right Way
- RSS Feeds
- Lock-Free Multi-Producer Multi-Consumer Queue on Ring Buffer
- Introduction to MapReduce with Hadoop on Linux
- Help with Designing or Debugging CORBA Applications
- New Products
- Returning Values from Bash Functions
- Linux Systems Administrator
- Welcome to 1998
10 min 52 sec ago - notifier shortcomings
34 min 34 sec ago - heroku?
2 hours 11 min ago - Android User
2 hours 13 min ago - Reply to comment | Linux Journal
4 hours 6 min ago - compiling
6 hours 55 min ago - This is a good post. This
12 hours 8 min ago - Great, This is really amazing
12 hours 10 min ago - These posts are really good
12 hours 12 min ago - It’s a really great site you
12 hours 14 min ago
Featured Jobs
| Linux Systems Administrator | Houston and Austin, Texas | Host Gator |
| Senior Perl Developer | Austin, Texas | Host Gator |
| Technical Support Rep | Houston and Austin, Texas | Host Gator |
| UX Designer | Austin, Texas | Host Gator |
| Web & UI Developer (JavaScript & j Query) | Austin, Texas | Host Gator |
Free Webinar: Hadoop
How to Build an Optimal Hadoop Cluster to Store and Maintain Unlimited Amounts of Data Using Microservers
Realizing the promise of Apache® Hadoop® requires the effective deployment of compute, memory, storage and networking to achieve optimal results. With its flexibility and multitude of options, it is easy to over or under provision the server infrastructure, resulting in poor performance and high TCO. Join us for an in depth, technical discussion with industry experts from leading Hadoop and server companies who will provide insights into the key considerations for designing and deploying an optimal Hadoop cluster.
Some of key questions to be discussed are:
- What is the “typical” Hadoop cluster and what should be installed on the different machine types?
- Why should you consider the typical workload patterns when making your hardware decisions?
- Are all microservers created equal for Hadoop deployments?
- How do I plan for expansion if I require more compute, memory, storage or networking?








Comments
The C code
I had the same thought as Rob Hooft about the timing when I read this article, and came here to comment. As I suspected, main() gets optimized away to nothing because the result is never used. I typed in the code and compiled it with gcc 4.3 -O3 -fverbose-asm -S:
objdump -d matmul
00000000004004b0 :
4004b0: 31 c0 xor %eax,%eax
4004b2: c3 retq
4004b3: 90 nop
4004b4: 90 nop
...
(I added a return 0 to main(), hence clearing the return value register with xor.)
0.01 seconds is typical for program that just return true without doing anything at all (other than dynamically linking libc, and other libc initialization, at startup, etc). On my system, (Ubuntu Intrepid with Linux 2.6.27, Ubuntu's "generic" AMD64 kernel image on a C2D: E6600 DDR2-800 g965), the program compiled with gcc -O3 takes 0.001 seconds. The reported time has more to do with kernel time resolution than anything else.
Berthold is also correct, although I didn't know what he meant at first, since I wasn't very familiar with numpy. In numpy, a1 * a2 is an element-wise product: in C:
a3[i][j] = a1[i][j] * a2[i][j];
instead of the inner loop over k. This is not the same as what C = A * B means in usual mathematical notation, hence the confusion. There, it means the matrix product, which is what the C routine calculates, and what dgemm() from BLAS calculates. In numpy, that is numpy.dot(a1, a2).
If you don't want gcc (or any other compiler) to optimize away a calculation, you have to use the result. Either with gcc tricks like __attribute__ ((used)) or by passing the address of the output array to a function that the compiler can't see while it's compiling the function you're testing. Calling a function in a file that you compile separately will defeat any optimizer except e.g. Sun Studio's cross-file optimizer mode that puts extra information in the .o files...
matmul.c:
/* usemem.c contains: void usemem(void*p){} * gcc won't do cross-file inlining/optimizations when you don't * compile both files at the same time. Note that -O implies -funit-at-a-time */ void usemem(void *); #define ASIZE 5000 // macro this so you can redefine it for other compilers. #define DECLARE_ALIGNED( var, n ) var __attribute__((aligned(n))) static double DECLARE_ALIGNED(a1[ASIZE][ASIZE], 128); static double DECLARE_ALIGNED(a2[ASIZE][ASIZE], 128); static __attribute__ ((used, aligned(128))) double a3[ASIZE][ASIZE]; int main(){ int i,j; for(i=0; i< ASIZE ; i++){ for(j=0; j< ASIZE ; j++){ double tmp = 0; // allows more optimization than referencing a3[i][j] repeatedly. Really. C is like that. #ifdef ELEMENTWISE tmp = a1[i][j] * a2[i][j]; #else int k; for(k=0; k< ASIZE ; k++){ tmp += a1[i][k] * a2[k][j]; } #endif a3[i][j] = tmp; } } // usemem(a3); return 0; }matmul.py:
results:
gcc 4.3's auto-vectorized (SSE2) version is twice as fast as gcc 3.4's scalar version on a large array like this that doesn't fit in cache (process RSS ~= 450MB). gcc4 vectorizes the loop to process two columns at once, so it only triggers half the cache misses of a direct interpretation of the C code. going down a colum in a row-major array is slow, because each successive element is in a new cache line. Optimized libraries, like ATLAS's BLAS implementation, which numpy uses, can reduce memory access costs to ~n^2 instead of ~n^3, by working in blocks that fit in the cache. GCC's vectorization illustrates that there's a lot to be gained from improved memory access patterns. (it's also a win when you're FPU limited, on small cached arrays, but trust me, it's the memory access that's giving the factor of 2 speedup with an array of 5000. Use oprofile yourself if you want to see.)
So numpy vs. naive C is a huge win if your arrays are not tiny, since even a few decent-sized naive matmuls will dominate your run time regardless of language.
BTW, I used to work at Dalhousie as a Linux cluster sysadmin and all around geek (with the phylogenetics group), where I had heard of ACEnet, where Joey works. Hi!
Error in exaple
Joey,
Your runtime comparison does compare apple with oranges. The C codes does not compute a1*a2 but numpy.dot(a1, a2) in numpy terms.
Regards
Berthold
C optimization
Joey,
Thanks for a nice article expressing a lot of my feelings regarding the comparison of C/Fortran and Python.
I do have one remark: your 250000 element matrix multiplication requires 2x500^3 = 250M floating point operations. You are suggesting that the C program performed those in 0.01 second. That requires 25Gflops and a similar amount of integer operations, simultaneously. This is more than most of us have available in their desktops. I think this shows that "gcc -O3" optimized your entire block of code out of the program, leaving just the startup and termination of the program.
In fact, the Python program you show here performs very close to the optimum, since the total interpreter overhead is only a handful of lines of code. It is in small arrays, like 3x3 matrix multiplications, that C can be significantly faster than Python.