Painless Concurrency: The multiprocessing Module
|
Author: Roberto Alsina The author has been around python for a while, and is finally getting the hang of it. Blog: http://lateral.netmanagers.com.ar twitter: @ralsina identi.ca: @ralsina |
Sometimes when you are working on a program you run into one of the classic problems: your user interface blocks. We are performing some long task and the window "freezes", jams, doesn't update until the operation is over.
Sometimes we can live with it, but in general it gives the image of amateurish, or badly written application.
The traditional solution for this problem is making your program multi threaded, and run more than one parallel thread. You pass the expensive operation to a secondary thread, do what it takes so the application looks alive, wait until the thread ends, and move on.
Here is a toy example:
# -*- coding: utf-8 -*- import threading import time def trabajador(): print "Starting to work" time.sleep(2) print "Finished working" def main(): print "Starting the main program" thread = threading.Thread(target=trabajador) print "Launching thread" thread.start() print "Thread has been launched" # isAlive() is False when the thread ends. while thread.isAlive(): # Here you would have the code to make the app # look "alive", a progress bar, or maybe just # keep on working as usual. print "The thread is still running" # Wait a little bit, or until the thread ends, # whatever's shorter. thread.join(.3) print "Program ended" # Important: the modules should not execute code # when they are imported if __name__ == '__main__': main()
Ir produces this output:
$ python demo_threading_1.py Starting the main program Launching thread Thread has been launched The thread is still running Starting to work The thread is still running The thread is still running The thread is still running The thread is still running The thread is still running The thread is still running Finished working Program ended
It's tempting to say "threading is nice!" but... remember this was a toy example. It turns out that using threads in Python has some caveats.
You are not using multiple cores.
Since there is a global lock in the interpreter, it turns out that python instructions, even when in more than one thread, are executed in sequence.
The exception is that if you program does I/O, while you are doing it, the interpreter works.
It's easy to shoot your own foot
Paraphrasing Jamie Zawinsky, if when you see a problem you think "I'll fix it using threads"... now you have two problems.
There is no way to forcibly interrupt a thread! That makes it possible to lock your app in complicated ways.
It's harder to debug multi threaded apps, specifically for race conditions and deadlocks.
So, what can we do? Use processes instead of threads. Let's see an example that's suspiciously similar to the previous one:
# -*- coding: utf-8 -*- import multiprocessing import time def worker(): print "Starting to work" time.sleep(2) print "Finished working" def main(): print "Starting the main program" thread = processing.Process(target=trabajador) print "Launching thread" thread.start() print "Thread has been launched" # isAlive() is False when the thread ends. while thread.isAlive(): # Here you would have the code to make the app # look "alive", a progress bar, or maybe just # keep on working as usual. print "The thread is still running" # Wait a little bit, or until the thread ends, # whatever's shorter. thread.join(.3) print "Program ended" # Important: the modules should not execute code # when they are imported if __name__ == '__main__': main()
Yes, the only change is import multiprocessing instead of import threading and Process instead of Thread. Now the worker function runs in a separate Python interpreter. Since they are separate processes, this will use as many cores as processes you have, so it may be much faster on a modern computer.
I mentioned deadlocks earlier. You may believe that with a little care, if you place locks around variables you can avoid them. Well, no. Let's see two functions f1 and f2 which use two variables x and y protected by locks lockx and locky.
# -*- coding: utf-8 -*- import threading import time x = 4 y = 6 lock_x = threading.Lock() lock_y = threading.Lock() def f1(): lock_x.acquire() time.sleep(2) lock_y.acquire() time.sleep(2) lock_x.release() lock_y.release() def f2(): lock_y.acquire() time.sleep(2) lock_x.acquire() time.sleep(2) lock_y.release() lock_x.release() def main(): print "Starting main program" thread1 = threading.Thread(target=f1) thread2 = threading.Thread(target=f2) print "Launching threads" thread1.start() thread2.start() thread1.join() thread2.join() print "Both threads finished" print "Ending program" # Important: modules should not execute code # when you import them. if __name__ == '__main__': main()
If you run it, it locks. All variables are protected with locks and it still locks! What's happening is that while f1 acquires x and waits for y, f2 has acquired y and is waiting for x. Since neither one is going to give the other what it needs, both are stuck.
Trying to debug this sort of thing in non-trivial programs is awful, because it only happens when things occur in a given order and with a certain timing. It may happen 100% of the time on one computer and never in another which is a bit faster (or slower).
Add to it that many Python data structures (like dictionaries) are not reentrant and you need to protect many variables and these scenarios become more common.
How would this work with multiprocessing? Since you are not sharing resources because they are separate processes, there are no problems with resource contention, and no deadlocks.
When you use multiple processes, one way to handle this example is passing around the values you need. Your functions then will have no "side effects", making it more like functional programming in LISP or erlang. Example:
# -*- coding: utf-8 -*- import multiprocessing import time x = 4 y = 6 def f1(x,y): x = x+y print 'F1:', x def f2(x,y): y = x-y print 'F2:', y def main(): print "Starting main program" hilo1 = processing.Process(target=f1, args=(x,y)) hilo2 = processing.Process(target=f2, args=(x,y)) print "Launching threads" hilo1.start() hilo2.start() hilo1.join() hilo2.join() print "Both threads finished" print "Ending program" print "X:",x,"Y:",y # Important: modules should not execute # code when you import them if __name__ == '__main__': main()
Why am I not using any locks? Because the x and y of f1 and f2 are not the same as in the main program. They are copies. Why would I want to lock a copy?
If there is a case where a resource needs to be accessed sequentially, multiprocessing provides locks, semaphores, etc. with the same semantics as threading.
Or you can create a process to manage that resource and pass it data via a queue (Queue or Pipe classes) and voilĂ , the access is now sequential.
In general, with a little care on your program's design, multiprocessing has all the benefits of multi threading with the bonus of taking advantage of your hardware, and avoiding some headaches.
- Note:
- The multiprocessing module is available as part of the standard library in Python 2.6 or later. For other versions, you can get the processing module via PyPI.