Painless Concurrency: The multiprocessing Module

../processing/yo-small.gif

Author: Roberto Alsina

The author has been around python for a while, and is finally getting the hang of it.

Blog: http://lateral.netmanagers.com.ar

twitter: @ralsina

identi.ca: @ralsina

Sometimes when you are working on a program you run into one of the classic problems: your user interface blocks. We are performing some long task and the window "freezes", jams, doesn't update until the operation is over.

Sometimes we can live with it, but in general it gives the image of amateurish, or badly written application.

The traditional solution for this problem is making your program multi threaded, and run more than one parallel thread. You pass the expensive operation to a secondary thread, do what it takes so the application looks alive, wait until the thread ends, and move on.

Here is a toy example:

# -*- coding: utf-8 -*-
import threading
import time

def trabajador():
    print "Starting to work"
    time.sleep(2)
    print "Finished working"

def main():
    print "Starting the main program"
    thread = threading.Thread(target=trabajador)
    print "Launching thread"
    thread.start()
    print "Thread has been launched"
    
    # isAlive() is False when the thread ends.
    while thread.isAlive():
        # Here you would have the code to make the app
        # look "alive", a progress bar, or maybe just
        # keep on working as usual.
        print "The thread is still running"
        # Wait a little bit, or until the thread ends,
        # whatever's shorter.
        thread.join(.3)
    print "Program ended"

# Important: the modules should not execute code
# when they are imported
if __name__ == '__main__':
    main()

Ir produces this output:

$ python demo_threading_1.py
Starting the main program
Launching thread
Thread has been launched
The thread is still running
Starting to work
The thread is still running
The thread is still running
The thread is still running
The thread is still running
The thread is still running
The thread is still running
Finished working
Program ended

It's tempting to say "threading is nice!" but... remember this was a toy example. It turns out that using threads in Python has some caveats.

So, what can we do? Use processes instead of threads. Let's see an example that's suspiciously similar to the previous one:

# -*- coding: utf-8 -*-
import multiprocessing
import time

def worker():
    print "Starting to work"
    time.sleep(2)
    print "Finished working"

def main():
    print "Starting the main program"
    thread = processing.Process(target=trabajador)
    print "Launching thread"
    thread.start()
    print "Thread has been launched"
    
    # isAlive() is False when the thread ends.
    while thread.isAlive():
        # Here you would have the code to make the app
        # look "alive", a progress bar, or maybe just
        # keep on working as usual.
        print "The thread is still running"
        # Wait a little bit, or until the thread ends,
        # whatever's shorter.
        thread.join(.3)
    print "Program ended"

# Important: the modules should not execute code
# when they are imported
if __name__ == '__main__':
    main()

Yes, the only change is import multiprocessing instead of import threading and Process instead of Thread. Now the worker function runs in a separate Python interpreter. Since they are separate processes, this will use as many cores as processes you have, so it may be much faster on a modern computer.

I mentioned deadlocks earlier. You may believe that with a little care, if you place locks around variables you can avoid them. Well, no. Let's see two functions f1 and f2 which use two variables x and y protected by locks lockx and locky.

# -*- coding: utf-8 -*-
import threading
import time

x = 4
y = 6
lock_x = threading.Lock()
lock_y = threading.Lock()

def f1():
    lock_x.acquire()
    time.sleep(2)
    lock_y.acquire()
    time.sleep(2)
    lock_x.release()
    lock_y.release()
    
def f2():
    lock_y.acquire()
    time.sleep(2)
    lock_x.acquire()
    time.sleep(2)
    lock_y.release()
    lock_x.release()

def main():
    print "Starting main program"
    thread1 = threading.Thread(target=f1)
    thread2 = threading.Thread(target=f2)
    print "Launching threads"
    thread1.start()
    thread2.start()
    thread1.join()
    thread2.join()
    print "Both threads finished"
    print "Ending program"

# Important: modules should not execute code
# when you import them.
if __name__ == '__main__':
    main()

If you run it, it locks. All variables are protected with locks and it still locks! What's happening is that while f1 acquires x and waits for y, f2 has acquired y and is waiting for x. Since neither one is going to give the other what it needs, both are stuck.

Trying to debug this sort of thing in non-trivial programs is awful, because it only happens when things occur in a given order and with a certain timing. It may happen 100% of the time on one computer and never in another which is a bit faster (or slower).

Add to it that many Python data structures (like dictionaries) are not reentrant and you need to protect many variables and these scenarios become more common.

How would this work with multiprocessing? Since you are not sharing resources because they are separate processes, there are no problems with resource contention, and no deadlocks.

When you use multiple processes, one way to handle this example is passing around the values you need. Your functions then will have no "side effects", making it more like functional programming in LISP or erlang. Example:

# -*- coding: utf-8 -*-
import multiprocessing
import time

x = 4
y = 6

def f1(x,y):
    x = x+y
    print 'F1:', x
    
def f2(x,y):
    y = x-y
    print 'F2:', y

def main():
    print "Starting main program"
    hilo1 = processing.Process(target=f1, args=(x,y))
    hilo2 = processing.Process(target=f2, args=(x,y))
    print "Launching threads"
    hilo1.start()
    hilo2.start()
    hilo1.join()
    hilo2.join()
    print "Both threads finished"
    print "Ending program"
    print "X:",x,"Y:",y

# Important: modules should not execute
# code when you import them
if __name__ == '__main__':
    main()

Why am I not using any locks? Because the x and y of f1 and f2 are not the same as in the main program. They are copies. Why would I want to lock a copy?

If there is a case where a resource needs to be accessed sequentially, multiprocessing provides locks, semaphores, etc. with the same semantics as threading.

Or you can create a process to manage that resource and pass it data via a queue (Queue or Pipe classes) and voilĂ , the access is now sequential.

In general, with a little care on your program's design, multiprocessing has all the benefits of multi threading with the bonus of taking advantage of your hardware, and avoiding some headaches.

Note:
The multiprocessing module is available as part of the standard library in Python 2.6 or later. For other versions, you can get the processing module via PyPI.

PDF version. | reSt version

Help PET: Donate

blog comments powered by Disqus

Last change: Thu Sep 9 17:11:29 2010.  -  This magazine is under a Creative Commons license.