Use Python for Scientific Computing

 in
As a general-purpose programming language, Python's benefits are well recognized. With the help of some add-on packages, you can use Python for scientific computing tasks as well.

which will run a five-process Python and run the test script.

Now, we can write our program to divide our large data set among the available processors if that is our bottleneck. Or, if we want to do a large simulation, we can divide the simulation space among all the available processors. Unfortunately, a useful discussion of MPI programming would be another article or two on its own. But, I encourage you to get a good textbook on MPI and do some experimenting yourself.

Conclusion

Although any interpreted language will have a hard time matching the speed of a compiled, optimized language, we have seen that this is not as big a deterrent as it once was. Modern machines run fast enough to more than make up for the overhead of interpretation. This opens up the world of complex applications to using languages like Python.

This article has been able to introduce only the most basic available features. Fortunately, many very good tutorials have been written and are available from the main SciPy site. So, go out and do more science, the Python way.

Joey Bernard has a background in both physics and computer science. Finally, his latest job with ACEnet has given him the opportunity to use both degrees at the same time, helping researchers do HPC work.

______________________

Joey Bernard has a background in both physics and computer science. This serves him well in his day job as a computational research consultant at the University of New Brunswick. He also teaches computational physics and parallel programming.

Comments

Comment viewing options

Select your preferred way to display the comments and click "Save settings" to activate your changes.

The C code

Peter Cordes's picture

I had the same thought as Rob Hooft about the timing when I read this article, and came here to comment. As I suspected, main() gets optimized away to nothing because the result is never used. I typed in the code and compiled it with gcc 4.3 -O3 -fverbose-asm -S:

objdump -d matmul
00000000004004b0 :
4004b0: 31 c0 xor %eax,%eax
4004b2: c3 retq
4004b3: 90 nop
4004b4: 90 nop
...
(I added a return 0 to main(), hence clearing the return value register with xor.)

0.01 seconds is typical for program that just return true without doing anything at all (other than dynamically linking libc, and other libc initialization, at startup, etc). On my system, (Ubuntu Intrepid with Linux 2.6.27, Ubuntu's "generic" AMD64 kernel image on a C2D: E6600 DDR2-800 g965), the program compiled with gcc -O3 takes 0.001 seconds. The reported time has more to do with kernel time resolution than anything else.

Berthold is also correct, although I didn't know what he meant at first, since I wasn't very familiar with numpy. In numpy, a1 * a2 is an element-wise product: in C:
a3[i][j] = a1[i][j] * a2[i][j];
instead of the inner loop over k. This is not the same as what C = A * B means in usual mathematical notation, hence the confusion. There, it means the matrix product, which is what the C routine calculates, and what dgemm() from BLAS calculates. In numpy, that is numpy.dot(a1, a2).

If you don't want gcc (or any other compiler) to optimize away a calculation, you have to use the result. Either with gcc tricks like __attribute__ ((used)) or by passing the address of the output array to a function that the compiler can't see while it's compiling the function you're testing. Calling a function in a file that you compile separately will defeat any optimizer except e.g. Sun Studio's cross-file optimizer mode that puts extra information in the .o files...

matmul.c:

/* usemem.c contains: void usemem(void*p){}
 * gcc won't do cross-file inlining/optimizations when you don't
 * compile both files at the same time.  Note that -O implies -funit-at-a-time */
void usemem(void *);

#define ASIZE 5000

// macro this so you can redefine it for other compilers.
#define DECLARE_ALIGNED( var, n ) var __attribute__((aligned(n)))
static double DECLARE_ALIGNED(a1[ASIZE][ASIZE], 128);
static double DECLARE_ALIGNED(a2[ASIZE][ASIZE], 128);
static __attribute__ ((used, aligned(128))) double a3[ASIZE][ASIZE];

int main(){
        int i,j;
        for(i=0; i&lt ASIZE ; i++){
                for(j=0; j&lt ASIZE ; j++){
                        double tmp = 0;  // allows more optimization than referencing a3[i][j] repeatedly.  Really.  C is like that.
#ifdef ELEMENTWISE
                        tmp = a1[i][j] * a2[i][j];
#else
                        int k;
                        for(k=0; k&lt ASIZE ; k++){
                                tmp += a1[i][k] * a2[k][j];
                        }
#endif
                        a3[i][j] = tmp;
                }
        }
//      usemem(a3);
        return 0;
}

matmul.py:

import numpy

a1=numpy.ones((5000,5000))
a2=numpy.ones((5000,5000))
a1[1,1] = 4
a1[1,2] = 4
a1[2,1] = 4
a1[2,2] = 4
a2[1,1] = 4
a2[1,2] = 4
a2[2,1] = 4
a2[2,2] = 4
a2[0,0] = 4

# a3=a1*a2
a3=numpy.dot(a1, a2)
print a3

results:

$ time python matmul.py
real    0m59.322s
user    0m58.256s
sys     0m0.496s

$ time ./matmul-gcc3.4
real    22m11.913s
user    21m57.890s
sys     0m2.552s

gcc -O3 -march=native -frecord-gcc-switches (gcc 4.3)
$ time ./matmul-O3-native-vect
real    11m41.880s
user    11m32.791s
sys     0m1.808s

$ time ./matmul-used-aligned
real    11m38.105s
user    11m28.923s
sys     0m1.536s

gcc -O3 -march=native -DELEMENTWISE
$ time ./matmul-elementwise
real    0m0.034s
user    0m0.004s
sys     0m0.004s

gcc 4.3's auto-vectorized (SSE2) version is twice as fast as gcc 3.4's scalar version on a large array like this that doesn't fit in cache (process RSS ~= 450MB). gcc4 vectorizes the loop to process two columns at once, so it only triggers half the cache misses of a direct interpretation of the C code. going down a colum in a row-major array is slow, because each successive element is in a new cache line. Optimized libraries, like ATLAS's BLAS implementation, which numpy uses, can reduce memory access costs to ~n^2 instead of ~n^3, by working in blocks that fit in the cache. GCC's vectorization illustrates that there's a lot to be gained from improved memory access patterns. (it's also a win when you're FPU limited, on small cached arrays, but trust me, it's the memory access that's giving the factor of 2 speedup with an array of 5000. Use oprofile yourself if you want to see.)

So numpy vs. naive C is a huge win if your arrays are not tiny, since even a few decent-sized naive matmuls will dominate your run time regardless of language.

BTW, I used to work at Dalhousie as a Linux cluster sysadmin and all around geek (with the phylogenetics group), where I had heard of ACEnet, where Joey works. Hi!

Error in exaple

bhoel's picture

Joey,

Your runtime comparison does compare apple with oranges. The C codes does not compute a1*a2 but numpy.dot(a1, a2) in numpy terms.

Regards
Berthold

C optimization

Rob Hooft's picture

Joey,

Thanks for a nice article expressing a lot of my feelings regarding the comparison of C/Fortran and Python.

I do have one remark: your 250000 element matrix multiplication requires 2x500^3 = 250M floating point operations. You are suggesting that the C program performed those in 0.01 second. That requires 25Gflops and a similar amount of integer operations, simultaneously. This is more than most of us have available in their desktops. I think this shows that "gcc -O3" optimized your entire block of code out of the program, leaving just the startup and termination of the program.

In fact, the Python program you show here performs very close to the optimum, since the total interpreter overhead is only a handful of lines of code. It is in small arrays, like 3x3 matrix multiplications, that C can be significantly faster than Python.

White Paper
Linux Management with Red Hat Satellite: Measuring Business Impact and ROI

Linux has become a key foundation for supporting today's rapidly growing IT environments. Linux is being used to deploy business applications and databases, trading on its reputation as a low-cost operating environment. For many IT organizations, Linux is a mainstay for deploying Web servers and has evolved from handling basic file, print, and utility workloads to running mission-critical applications and databases, physically, virtually, and in the cloud. As Linux grows in importance in terms of value to the business, managing Linux environments to high standards of service quality — availability, security, and performance — becomes an essential requirement for business success.

Learn More

Sponsored by Red Hat

White Paper
Private PaaS for the Agile Enterprise

If you already use virtualized infrastructure, you are well on your way to leveraging the power of the cloud. Virtualization offers the promise of limitless resources, but how do you manage that scalability when your DevOps team doesn’t scale? In today’s hypercompetitive markets, fast results can make a difference between leading the pack vs. obsolescence. Organizations need more benefits from cloud computing than just raw resources. They need agility, flexibility, convenience, ROI, and control.

Stackato private Platform-as-a-Service technology from ActiveState extends your private cloud infrastructure by creating a private PaaS to provide on-demand availability, flexibility, control, and ultimately, faster time-to-market for your enterprise.

Learn More

Sponsored by ActiveState