Use Python for Scientific Computing
which will run a five-process Python and run the test script.
Now, we can write our program to divide our large data set among the available processors if that is our bottleneck. Or, if we want to do a large simulation, we can divide the simulation space among all the available processors. Unfortunately, a useful discussion of MPI programming would be another article or two on its own. But, I encourage you to get a good textbook on MPI and do some experimenting yourself.
Although any interpreted language will have a hard time matching the speed of a compiled, optimized language, we have seen that this is not as big a deterrent as it once was. Modern machines run fast enough to more than make up for the overhead of interpretation. This opens up the world of complex applications to using languages like Python.
This article has been able to introduce only the most basic available features. Fortunately, many very good tutorials have been written and are available from the main SciPy site. So, go out and do more science, the Python way.
ScientificPython
numpy and scipy are not the only options available to Python programmers. Another popular package is ScientificPython. It includes geometric types (such as vectors, tensors and quaternions), polynomials, basic statistics, derivatives, interpolation and more. This is the same type of functionality available in scipy. The major difference is that ScientificPython has the ability to do parallel programming built in, whereas scipy requires an extra module. This is done with a partial implementation of MPI and an implementation of the Bulk Synchronous Parallel library (BSPlib).
LAPACK and BLAS
The argument can be made that comparing the complexity of C and FORTRAN to that of Python is unfair, because we actually are using add-on packages in Python. Equivalent libraries can be used in C and FORTRAN, with LAPACK and BLAS being some of the more popular. BLAS provides basic linear algebra functions, while LAPACK builds on these to provide more complex scientific functions. Although these libraries provide optimized routines that will extract every useful cycle from your hardware and are much simpler to write than straight C or FORTRAN, they still are orders of magnitude more complex than the equivalent in Python. If you really do need to squeeze out every last tick from your machine, however, nothing will beat these types of libraries.
Types of Parallel Programming
Parallel programs can, in general, be broken down into two broad categories: shared memory and message passing. In shared-memory parallel programming, the code runs on one physical machine and uses multiple processors. Examples of this type of parallel programming include POSIX threads and OpenMP. This type of parallel code is restricted to the size of the machine that you can build.
To bypass this restriction, you can use message-passing parallel code. In this form, independent execution units communicate by passing messages back and forth. This means they can be on separate machines, as long as they have some means of communication. Examples of this type of parallel programming include MPICH and OpenMPI. Most scientific applications use message passing to achieve parallelism.
Resources
Python Programming Language—Official Web Site: www.python.org
SciPy: www.scipy.org
ScientificPython—Theoretical Biophysics, Molecular Simulation, and Numerically Intensive Computation: dirac.cnrs-orleans.fr/plone/software/scientificpython
Joey Bernard has a background in both physics and computer science. Finally, his latest job with ACEnet has given him the opportunity to use both degrees at the same time, helping researchers do HPC work.
- « first
- ‹ previous
- 1
- 2
- 3
Joey Bernard has a background in both physics and computer science. This serves him well in his day job as a computational research consultant at the University of New Brunswick. He also teaches computational physics and parallel programming.
Realizing the promise of Apache® Hadoop® requires the effective deployment of compute, memory, storage and networking to achieve optimal results. With its flexibility and multitude of options, it is easy to over or under provision the server infrastructure, resulting in poor performance and high TCO. Join us for an in depth, technical discussion with industry experts from leading Hadoop and server companies who will provide insights into the key considerations for designing and deploying an optimal Hadoop cluster.
Sponsored by AMD
Built-in forensics, incident response, and security with Red Hat Enterprise Linux 6
Every security policy provides guidance and requirements for ensuring adequate protection of information and data, as well as high-level technical and administrative security requirements for a system in a given environment. Traditionally, providing security for a system focuses on the confidentiality of the information on it. However, protecting the data integrity and system and data availability is just as important. For example, when processing United States intelligence information, there are three attributes that require protection: confidentiality, integrity, and availability.
Learn more about catching the bad guy in this free white paper.
Sponsored by DLT Solutions
| Dynamic DNS—an Object Lesson in Problem Solving | May 21, 2013 |
| Using Salt Stack and Vagrant for Drupal Development | May 20, 2013 |
| Making Linux and Android Get Along (It's Not as Hard as It Sounds) | May 16, 2013 |
| Drupal Is a Framework: Why Everyone Needs to Understand This | May 15, 2013 |
| Home, My Backup Data Center | May 13, 2013 |
| Non-Linux FOSS: Seashore | May 10, 2013 |
- RSS Feeds
- Making Linux and Android Get Along (It's Not as Hard as It Sounds)
- Using Salt Stack and Vagrant for Drupal Development
- New Products
- Validate an E-Mail Address with PHP, the Right Way
- Drupal Is a Framework: Why Everyone Needs to Understand This
- A Topic for Discussion - Open Source Feature-Richness?
- Download the Free Red Hat White Paper "Using an Open Source Framework to Catch the Bad Guy"
- Home, My Backup Data Center
- Tech Tip: Really Simple HTTP Server with Python
- Please correct the URL for Salt Stack's web site
1 hour 16 min ago - Android is Linux -- why no better inter-operation
3 hours 32 min ago - Connecting Android device to desktop Linux via USB
4 hours 35 sec ago - Find new cell phone and tablet pc
4 hours 58 min ago - Epistle
6 hours 27 min ago - Automatically updating Guest Additions
7 hours 36 min ago - I like your topic on android
8 hours 22 min ago - This is the easiest tutorial
14 hours 58 min ago - Ahh, the Koolaid.
20 hours 36 min ago - git-annex assistant
1 day 2 hours ago
Enter to Win an Adafruit Pi Cobbler Breakout Kit for Raspberry Pi

It's Raspberry Pi month at Linux Journal. Each week in May, Adafruit will be giving away a Pi-related prize to a lucky, randomly drawn LJ reader. Winners will be announced weekly.
Fill out the fields below to enter to win this week's prize-- a Pi Cobbler Breakout Kit for Raspberry Pi.
Congratulations to our winners so far:
- 5-8-13, Pi Starter Pack: Jack Davis
- 5-15-13, Pi Model B 512MB RAM: Patrick Dunn
- 5-21-13, Prototyping Pi Plate Kit: Philip Kirby
- Next winner announced on 5-27-13!
Free Webinar: Hadoop
How to Build an Optimal Hadoop Cluster to Store and Maintain Unlimited Amounts of Data Using Microservers
Realizing the promise of Apache® Hadoop® requires the effective deployment of compute, memory, storage and networking to achieve optimal results. With its flexibility and multitude of options, it is easy to over or under provision the server infrastructure, resulting in poor performance and high TCO. Join us for an in depth, technical discussion with industry experts from leading Hadoop and server companies who will provide insights into the key considerations for designing and deploying an optimal Hadoop cluster.
Some of key questions to be discussed are:
- What is the “typical” Hadoop cluster and what should be installed on the different machine types?
- Why should you consider the typical workload patterns when making your hardware decisions?
- Are all microservers created equal for Hadoop deployments?
- How do I plan for expansion if I require more compute, memory, storage or networking?




Comments
The C code
I had the same thought as Rob Hooft about the timing when I read this article, and came here to comment. As I suspected, main() gets optimized away to nothing because the result is never used. I typed in the code and compiled it with gcc 4.3 -O3 -fverbose-asm -S:
objdump -d matmul
00000000004004b0 :
4004b0: 31 c0 xor %eax,%eax
4004b2: c3 retq
4004b3: 90 nop
4004b4: 90 nop
...
(I added a return 0 to main(), hence clearing the return value register with xor.)
0.01 seconds is typical for program that just return true without doing anything at all (other than dynamically linking libc, and other libc initialization, at startup, etc). On my system, (Ubuntu Intrepid with Linux 2.6.27, Ubuntu's "generic" AMD64 kernel image on a C2D: E6600 DDR2-800 g965), the program compiled with gcc -O3 takes 0.001 seconds. The reported time has more to do with kernel time resolution than anything else.
Berthold is also correct, although I didn't know what he meant at first, since I wasn't very familiar with numpy. In numpy, a1 * a2 is an element-wise product: in C:
a3[i][j] = a1[i][j] * a2[i][j];
instead of the inner loop over k. This is not the same as what C = A * B means in usual mathematical notation, hence the confusion. There, it means the matrix product, which is what the C routine calculates, and what dgemm() from BLAS calculates. In numpy, that is numpy.dot(a1, a2).
If you don't want gcc (or any other compiler) to optimize away a calculation, you have to use the result. Either with gcc tricks like __attribute__ ((used)) or by passing the address of the output array to a function that the compiler can't see while it's compiling the function you're testing. Calling a function in a file that you compile separately will defeat any optimizer except e.g. Sun Studio's cross-file optimizer mode that puts extra information in the .o files...
matmul.c:
/* usemem.c contains: void usemem(void*p){} * gcc won't do cross-file inlining/optimizations when you don't * compile both files at the same time. Note that -O implies -funit-at-a-time */ void usemem(void *); #define ASIZE 5000 // macro this so you can redefine it for other compilers. #define DECLARE_ALIGNED( var, n ) var __attribute__((aligned(n))) static double DECLARE_ALIGNED(a1[ASIZE][ASIZE], 128); static double DECLARE_ALIGNED(a2[ASIZE][ASIZE], 128); static __attribute__ ((used, aligned(128))) double a3[ASIZE][ASIZE]; int main(){ int i,j; for(i=0; i< ASIZE ; i++){ for(j=0; j< ASIZE ; j++){ double tmp = 0; // allows more optimization than referencing a3[i][j] repeatedly. Really. C is like that. #ifdef ELEMENTWISE tmp = a1[i][j] * a2[i][j]; #else int k; for(k=0; k< ASIZE ; k++){ tmp += a1[i][k] * a2[k][j]; } #endif a3[i][j] = tmp; } } // usemem(a3); return 0; }matmul.py:
results:
gcc 4.3's auto-vectorized (SSE2) version is twice as fast as gcc 3.4's scalar version on a large array like this that doesn't fit in cache (process RSS ~= 450MB). gcc4 vectorizes the loop to process two columns at once, so it only triggers half the cache misses of a direct interpretation of the C code. going down a colum in a row-major array is slow, because each successive element is in a new cache line. Optimized libraries, like ATLAS's BLAS implementation, which numpy uses, can reduce memory access costs to ~n^2 instead of ~n^3, by working in blocks that fit in the cache. GCC's vectorization illustrates that there's a lot to be gained from improved memory access patterns. (it's also a win when you're FPU limited, on small cached arrays, but trust me, it's the memory access that's giving the factor of 2 speedup with an array of 5000. Use oprofile yourself if you want to see.)
So numpy vs. naive C is a huge win if your arrays are not tiny, since even a few decent-sized naive matmuls will dominate your run time regardless of language.
BTW, I used to work at Dalhousie as a Linux cluster sysadmin and all around geek (with the phylogenetics group), where I had heard of ACEnet, where Joey works. Hi!
Error in exaple
Joey,
Your runtime comparison does compare apple with oranges. The C codes does not compute a1*a2 but numpy.dot(a1, a2) in numpy terms.
Regards
Berthold
C optimization
Joey,
Thanks for a nice article expressing a lot of my feelings regarding the comparison of C/Fortran and Python.
I do have one remark: your 250000 element matrix multiplication requires 2x500^3 = 250M floating point operations. You are suggesting that the C program performed those in 0.01 second. That requires 25Gflops and a similar amount of integer operations, simultaneously. This is more than most of us have available in their desktops. I think this shows that "gcc -O3" optimized your entire block of code out of the program, leaving just the startup and termination of the program.
In fact, the Python program you show here performs very close to the optimum, since the total interpreter overhead is only a handful of lines of code. It is in small arrays, like 3x3 matrix multiplications, that C can be significantly faster than Python.