Multiprocessing in Python

Python's "multiprocessing" module feels like threads, but actually launches processes.

Many people, when they start to work with Python, are excited to hear that the language supports threading. And, as I've discussed in previous articles, Python does indeed support native-level threads with an easy-to-use and convenient interface.

However, there is a downside to these threads—namely the global interpreter lock (GIL), which ensures that only one thread runs at a time. Because a thread cedes the GIL whenever it uses I/O, this means that although threads are a bad idea in CPU-bound Python programs, they're a good idea when you're dealing with I/O.

But even when you're using lots of I/O, you might prefer to take full advantage of a multicore system. And in the world of Python, that means using processes.

In my article "Launching External Processes in Python", I described how you can launch processes from within a Python program, but those examples all demonstrated that you can launch a program in an external process. Normally, when people talk about processes, they work much like they do with threads, but are even more independent (and with more overhead, as well).

So, it's something of a dilemma: do you launch easy-to-use threads, even though they don't really run in parallel? Or, do you launch new processes, over which you have little control?

The answer is somewhere in the middle. The Python standard library comes with "multiprocessing", a module that gives the feeling of working with threads, but that actually works with processes.

So in this article, I look at the "multiprocessing" library and describe some of the basic things it can do.

Multiprocessing Basics

The "multiprocessing" module is designed to look and feel like the "threading" module, and it largely succeeds in doing so. For example, the following is a simple example of a multithreaded program:


#!/usr/bin/env python3

import threading
import time
import random

def hello(n):
    time.sleep(random.randint(1,3))
    print("[{0}] Hello!".format(n))

for i in range(10):
    threading.Thread(target=hello, args=(i,)).start()

print("Done!")

In this example, there is a function (hello) that prints "Hello!" along with whatever argument is passed. It then runs a for loop that runs hello ten times, each of them in an independent thread.

But wait. Before the function prints its output, it first sleeps for a few seconds. When you run this program, you then end up with output that demonstrates how the threads are running in parallel, and not necessarily in the order they are invoked:


$ ./thread1.py
Done!
[2] Hello!
[0] Hello!
[3] Hello!
[6] Hello!
[9] Hello!
[1] Hello!
[5] Hello!
[8] Hello!
[4] Hello!
[7] Hello!

If you want to be sure that "Done!" is printed after all the threads have finished running, you can use join. To do that, you need to grab each instance of threading.Thread, put it in a list, and then invoke join on each thread:


#!/usr/bin/env python3

import threading
import time
import random

def hello(n):
    time.sleep(random.randint(1,3))
    print("[{0}] Hello!".format(n))

threads = [ ]
for i in range(10):
    t = threading.Thread(target=hello, args=(i,))
    threads.append(t)
    t.start()

for one_thread in threads:
    one_thread.join()

print("Done!")

The only difference in this version is it puts the thread object in a list ("threads") and then iterates over that list, joining them one by one.

But wait a second—I promised that I'd talk about "multiprocessing", not threading. What gives?

Well, "multiprocessing" was designed to give the feeling of working with threads. This is so true that I basically can do some search-and-replace on the program I just presented:

  • threading → multiprocessing
  • Thread → Process
  • threads → processes
  • thread → process

The result is as follows:


#!/usr/bin/env python3

import multiprocessing
import time
import random

def hello(n):
    time.sleep(random.randint(1,3))
    print("[{0}] Hello!".format(n))

processes = [ ]
for i in range(10):
    t = multiprocessing.Process(target=hello, args=(i,))
    processes.append(t)
    t.start()

for one_process in processes:
    one_process.join()

print("Done!")

In other words, you can run a function in a new process, with full concurrency and take advantage of multiple cores, with multiprocessing.Process. It works very much like a thread, including the use of join on the Process objects you create. Each instance of Process represents a process running on the computer, which you can see using ps, and which you can (in theory) stop with kill.

What's the Difference?

What's amazing to me is that the API is almost identical, and yet two very different things are happening behind the scenes. Let me try to make the distinction clearer with another pair of examples.

Perhaps the biggest difference, at least to anyone programming with threads and processes, is the fact that threads share global variables. By contrast, separate processes are completely separate; one process cannot affect another's variables. (In a future article, I plan to look at how to get around that.)

Here's a simple example of how a function running in a thread can modify a global variable (note that what I'm doing here is to prove a point; if you really want to modify global variables from within a thread, you should use a lock):


#!/usr/bin/env python3

import threading
import time
import random

mylist = [ ]

def hello(n):
    time.sleep(random.randint(1,3))
    mylist.append(threading.get_ident())   # bad in real code!
    print("[{0}] Hello!".format(n))

threads = [ ]
for i in range(10):
    t = threading.Thread(target=hello, args=(i,))
    threads.append(t)
    t.start()

for one_thread in threads:
    one_thread.join()

print("Done!")
print(len(mylist))
print(mylist)

The program is basically unchanged, except that it defines a new, empty list (mylist) at the top. The function appends its ID to that list and then returns.

Now, the way that I'm doing this isn't so wise, because Python data structures aren't thread-safe, and appending to a list from within multiple threads eventually will catch up with you. But the point here isn't to demonstrate threads, but rather to contrast them with processes.

When I run the above code, I get:


$ ./th-update-list.py
[0] Hello!
[2] Hello!
[6] Hello!
[3] Hello!
[1] Hello!
[4] Hello!
[5] Hello!
[7] Hello!
[8] Hello!
[9] Hello!
Done!
10
[123145344081920, 123145354592256, 123145375612928,
 ↪123145359847424, 123145349337088, 123145365102592,
 ↪123145370357760, 123145380868096, 123145386123264,
 ↪123145391378432]

So, you can see that the global variable mylist is shared by the threads, and that when one thread modifies the list, that change is visible to all the other threads.

But if you change the program to use "multiprocessing", the output looks a bit different:


#!/usr/bin/env python3

import multiprocessing
import time
import random
import os

mylist = [ ]

def hello(n):
    time.sleep(random.randint(1,3))
    mylist.append(os.getpid())
    print("[{0}] Hello!".format(n))

processes = [ ]
for i in range(10):
    t = multiprocessing.Process(target=hello, args=(i,))
    processes.append(t)
    t.start()

for one_process in processes:
    one_process.join()

print("Done!")
print(len(mylist))
print(mylist)

Aside from the switch to multiprocessing, the biggest change in this version of the program is the use of os.getpid to get the current process ID.

The output from this program is as follows:


$ ./proc-update-list.py
[0] Hello!
[4] Hello!
[7] Hello!
[8] Hello!
[2] Hello!
[5] Hello!
[6] Hello!
[9] Hello!
[1] Hello!
[3] Hello!
Done!
0
[]

Everything seems great until the end when it checks the value of mylist. What happened to it? Didn't the program append to it?

Sort of. The thing is, there is no "it" in this program. Each time it creates a new process with "multiprocessing", each process has its own value of the global mylist list. Each process thus adds to its own list, which goes away when the processes are joined.

This means the call to mylist.append succeeds, but it succeeds in ten different processes. When the function returns from executing in its own process, there is no trace left of the list from that process. The only mylist variable in the main process remains empty, because no one ever appended to it.

Queues to the Rescue

In the world of threaded programs, even when you're able to append to the global mylist variable, you shouldn't do it. That's because Python's data structures aren't thread-safe. Indeed, only one data structure is guaranteed to be thread safe—the Queue class in the multiprocessing module.

Queues are FIFOs (that is, "first in, first out"). Whoever wants to add data to a queue invokes the put method on the queue. And whoever wants to retrieve data from a queue uses the get command.

Now, queues in the world of multithreaded programs prevent issues having to do with thread safety. But in the world of multiprocessing, queues allow you to bridge the gap among your processes, sending data back to the main process. For example:


#!/usr/bin/env python3

import multiprocessing
import time
import random
import os
from multiprocessing import Queue

q = Queue()

def hello(n):
    time.sleep(random.randint(1,3))
    q.put(os.getpid())
    print("[{0}] Hello!".format(n))

processes = [ ]
for i in range(10):
    t = multiprocessing.Process(target=hello, args=(i,))
    processes.append(t)
    t.start()

for one_process in processes:
    one_process.join()

mylist = [ ]
while not q.empty():
    mylist.append(q.get())

print("Done!")
print(len(mylist))
print(mylist)

In this version of the program, I don't create mylist until late in the game. However, I create an instance of multiprocessing.Queue very early on. That Queue instance is designed to be shared across the different processes. Moreover, it can handle any type of Python data that can be stored using "pickle", which basically means any data structure.

In the hello function, it replaces the call to mylist.append with one to q.put, placing the current process' ID number on the queue. Each of the ten processes it creates will add its own PID to the queue.

Note that this program takes place in stages. First it launches ten processes, then they all do their work in parallel, and then it waits for them to complete (with join), so that it can process the results. It pulls data off the queue, puts it onto mylist, and then performs some calculations on the data it has retrieved.

The implementation of queues is so smooth and easy to work with, it's easy to forget that these queues are using some serious behind-the-scenes operating system magic to keep things coordinated. It's easy to think that you're working with threading, but that's just the point of multiprocessing; it might feel like threads, but each process runs separately. This gives you true concurrency within your program, something threads cannot do.

Conclusion

Threading is easy to work with, but threads don't truly execute in parallel. Multiprocessing is a module that provides an API that's almost identical to that of threads. This doesn't paper over all of the differences, but it goes a long way toward making sure things aren't out of control.

Reuven Lerner teaches Python, data science and Git to companies around the world. You can subscribe to his free, weekly "better developers" e-mail list, and learn from his books and courses at http://lerner.co.il. Reuven lives with his wife and children in Modi'in, Israel.

Load Disqus comments