# Parallel computing in Cython - threads

One of the cool things about Cython is that it supports multi-threaded code, via OpenMP. While Python allows for message passing (i.e., multiple processes) shared memory (i.e., multi-threading) is not possible due to the Global Interpreter Lock (see this earlier post).

Relative to message passing, multi-threading is fast (and has lower memory requirements). The catch is that you can run into concurrency problems: where different threads need to access the same memory locations at the same time. As such, multi-threading is best suited to performing large numbers of simple calculations: where the order in which the calculations are executed doesn’t matter.

Recently, this post received some some discussion on stackoverflow. The main thing to remember is to place this line at the top of your Cython code as per this earlier post.

The stackoverflow discussion also demonstrates how performance can improve with different load scheduling options. From my experience it is hard to predict which load scheduling method will achieve best performance and the differences often aren’t that great. However, in the below example it does make a difference, so it is probably a good idea to experiment with this in practice.

# A simple Cython example

The perfect use case is applying a function element wise over a large array. Lets say we want to apply the function $$f$$ to some array $$X$$

$$f(x) = \begin{cases} e^x & \mbox{if } x > 0.5 \\ 0 & \mbox{if } otherwise \end{cases}$$

Below I have a Python version and a Cython version of $$f$$ in the file thread_demo.pyx

Next we compile this file with a standard setup.py (like we did here). Now lets test these functions in IPython:

In this case, the Cython version is not dramatically faster than Python, because our function is simple enough to ‘vectorise’ via numpy.

Now to create a multi-threaded version of c_array_f we just need to replace range() in the first loop with the multi-treaded version prange() from cython.parallel:

Too easy. This tells the compiler to run the loop across multiple CPU cores. In this case, we have no concurrency problems: the order in which the loop is executed doesn’t matter. Now we just need to make a few changes to our setup.py file, in order to compile our code with OpenMP:

Now lets give it a go

So with four cores we get just under a 4 times speed up.

# A few more details

So far I’ve glossed over a few important details. Here are some basic things to consider (for more see the documentation).

Firstly, notice the nogil argument in prange(N, nogil=True). In order to run multi-threaded code we need to turn off the GIL. This means that you can’t have Python code inside your multi-threaded loop or compilation will fail. It also means that any functions called inside the loop need to be defined nogil, for example:

prange() takes a few other arguments including num_threads: which will default to the number of cores on your system and schedule: which has to do with load balancing. The simplest option here is ‘static’ which just breaks the loop into equal chunks. This is fine if all the steps compute in around the same time. If not, one thread may finish before the others leaving resources idle. In this case, you might try ‘dynamic’ (see the docs for detail).

The other key issue is memory management, that is working out which variables should be shared between threads and which should be private or ‘thread local’. The good thing with Cython is that all of this detail is - in true Python style - magically inferred from your code.

The basic idea is that variables that are only read from are shared, while variables assigned to are private. An important special case are ‘reduction’ variables. Any variable with an in-place operator (i.e., +=’) is automatically taken to be a reduction variable, which means that the thread local values are combined after all threads have completed. This is useful if you need to compute a sum:

Note that if we instead put Ysum = Ysum + f(X[i]) we would have got a compiler error.

# Radial Basis Function (RBF) example

Recall the problem of evaluating a RBF approximation scheme from this post. Below we try out a multi-threaded version:

For some reason this achieves a better than 4 times speed up!

# Bottom line

Of course, the real challenge is in identifying / producing code segments that are suited to multi-threading. In many cases, blindly running a loop in parallel will either fail or lead to a decrease in performance if the threads get in each others way. But for simple applications at least, running multiple threads in Cython is as easy as adding prange()` to your loops.