Post

Speed up Python applications With Ctypes Library

There are multiple ways to speed up computations in Python. The cython language compiles Python-like syntax into CPython extensions. Libraries such as numpy provides methods to manipulate large arrays and matrices efficiently with underlying C data structures.

In this post, I will be discussing the ctypes module. It provides C-compatible data types to so that Python functions can use C-compiled shared libraries. Therefore, we can offload computationally intensive modules of a Python application into C where the developers will have more fine-grained control. To my surprise, this comes as part of the Python standard library, so no external dependencies are required!

Code to optimize - prime number checker

I have created a sample program that we can speed up afterwards using the ctypes module. The num_primes() calculates the total number of primes in a list by looping through each item.

1
2
3
4
5
6
7
8
9
10
11
12
# prime.py
def is_prime(num: int):
    for i in range(2, int(num**(0.5))):
        if num % i == 0:
            return 1
    return 0

def num_primes(num_list: List[int]):
    count = 0
    for num in num_list:
        count += is_prime(num)
    return count

Let’s see the number of primes in a list of 1 million integers. Note that we use consecutive numbers for the example but it does not have to be.

1
2
3
4
5
6
7
8
9
from prime import num_primes

MAX_NUM = 1000000
num_list = list(range(MAX_NUM))

def timeit_function():
    return num_primes(num_list)

print(timeit_function())

It takes around 3.4 seconds to run. How can we speed this up?

1
2
3
>>> python -m timeit -n 5 -s 'import test_python as t' 't.timeit_function()'
Primes: 921295
5 loops, best of 5: 3.4 sec per loop

Why Python threading module does not work

As Python has a threading module, one idea is to parallalize calculations across the entire list by using multiple threads. However, this is not possible due to Python’s Global Interpreter Lock (GIL), which prevents multiple threads in a process from executing Python bytecode at the same time. Hence for non-I/O operations, there will not be any speed up.

Rewriting prime checker in C

The prime checker is reimplemented in C as shown below, then compiled it into a shared library prime.so. Note that the program logic is exactly the same.

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
// prime.c
#include <stdio.h>
#include <math.h>

int is_prime(int num) {
    for (int i=2; i<(int)sqrt(num); i++) {
        if (num % i == 0)
            return 1;
    }
    return 0;
} 

int num_primes(int arrSize, int *numArr) {
    int count = 0;
    for (int i=0; i<arrSize; i++)
        count += is_prime(numArr[i]);
    return count;
}

Calling the C-compiled prime checker with ctypes

ctypes

The ctypes library provides C-compatible data types in Python. All we need to do is load the shared library with CDLL() API and then declare the parameters/ return types accordingly with argtypes and restype attributes.

1
2
3
4
5
6
7
8
from ctypes import *

# Load the shared library
lib = CDLL("./libprime.so")
# Declare the return data as 32-bit int
lib.num_primes.restype = c_int32
# Declare the arguments as a 32-bit int and a pointer for 32-bit int (for list)
lib.num_primes.argtypes = [c_int32, POINTER(c_int32)]

Afterwards, the num_primes() in the shared library can be called! Note that the num_list has to be converted from Python list into a contiguous array of C with a method provided by ctypes.

1
2
3
4
5
6
7
8
MAX_NUM = 1000000
num_list = list(range(MAX_NUM))

def timeit_function():
    # num_list is converted into an integer array of size MAX_NUM
    return lib.num_primes(MAX_NUM, (c_int32 * MAX_NUM)(*num_list))

print(f"Primes: {timeit_function()}")

For the same input of 1 million integers, the speed up is significant just by offloading the same program logic to C code. It makes sense because contiguous arrays in C can leverage caching mechanisms better than lists in Python.

1
2
3
>>> python -m timeit -n 5 -s 'import test_ctypes as t' 't.timeit_function()'
Primes: 921295
5 loops, best of 5: 482 msec per loop

Multithreading in the C shared library with POSIX pthreads

There is one more benefit of offloading the work to C. Since the shared library is not under Python’s GIL, we can now use multithreading in C to parallelize the computations!

In the code below, the integer array is split evenly into 4 subarrays and 4 threads are spawned with POSIX pthreads to do parallel work. Each thread runs thread_function() to check the numbers in the array without any overlap between threads. The counts of prime numbers are added into countByThreads array which are summed up after the child threads have terminated.

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
#define NUM_THREADS 4                       // 4 threads used

// Global variables for spawn threads to access
int *gArrSize = 0;                          // Ptr for array size
int *gNumArr = 0;                           // Ptr for input array
int countByThreads[NUM_THREADS] = { 0 };    // Prime counts of each thread
pthread_t tids[NUM_THREADS] = { 0 };        // IDs of each thread

// Function run by each thread
void *thread_function(void *vargp) {
    // Each thread has a different offset
    int offset = (*(int*) vargp);
    int count = 0;
    // Split the array items evenly across threads
    for (int i=offset; i < *gArrSize; i+=NUM_THREADS)
        count += is_prime(gNumArr[i]);
    countByThreads[offset] += count;
    free(vargp);
}

int num_primes(int arrSize, int *numArr) {
    gArrSize = &arrSize;
    gNumArr = numArr;
    for(int i=0; i < NUM_THREADS; i++) {
        int *offset = (int*) malloc(sizeof(int));
        *offset = i;
        if(pthread_create(&tids[i], NULL, thread_function, (void *) offset) == -1)
            exit(1);
    }
    int count = 0;
    for(int i=0; i < NUM_THREADS; i++) {
        if(pthread_join(tids[i], NULL) == -1)
            exit(1);
        // Combine counts from each thread after termination
        count += countByThreads[i];
        countByThreads[i] = 0;
    }
    return count;
}

We have further sped up the code execution although there is an additional overhead of managing threads.

1
2
3
>>> python -m timeit -n 5 -s 'import test_ctypes_pthread as t' 't.timeit_function()'
Primes: 921295
5 loops, best of 5: 322 msec per loop

Calling the C-compiled prime checker with Python threading

Remember the threading module in Python just now? Another neat thing about ctypes is that the program releases the GIL as long as the execution is inside the C-compiled shared library. So instead of POSIX pthreads in C, we can generate the threads with threading instead!

1
2
3
4
5
6
7
8
from ctypes import *

# Load the shared library
lib = CDLL("./libprime.so")
# Declare the return data as 32-bit integer
lib.num_primes.restype = c_int32
# Declare the arguments as a 32-bit integer & a pointer for 32-bit integer (for list)
lib.num_primes.argtypes = [c_int32, POINTER(c_int32)]

Afterwards, the num_primes() in the shared library can be called! Note that the num_list has to be converted from Python list into a contiguous array with a method provided by ctypes.

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
MAX_NUM = 1000000
NUM_THREADS = 4

# Prime counts per thread
count_list = [0 for _ in range(NUM_THREADS)]
# One list of numbers for each thread
num_list_list = []

# Split the list for multiple threads
for i in range(NUM_THREADS):
    num_list = list(range(i, MAX_NUM, NUM_THREADS))
    num_list_list.append(num_list)

# Function run by each thread
def thread_function(i, num_list, count_list):
    len_num_list = len(num_list)
    count_list[i] = lib.num_primes(len_num_list, (c_int32 * len_num_list)(*num_list))

def timeit_function():
    threads = []
    for i in range(NUM_THREADS):
        t = threading.Thread(target=thread_function, args=(i, num_list_list[i], count_list))
        t.start()
        threads.append(t)
    for thread in threads:
        thread.join()
    return sum(count_list) # Combine counts from each thread
    
print(f"Primes: {timeit_function()}")

For this example, the speed up is comparable to using pthreads.

1
2
3
>>> python -m timeit -n 5 -s 'import test_ctypes_threading as t' 't.timeit_function()'
Primes: 921295
5 loops, best of 5: 313 msec per loop

The code demonstrations can be found here.

Resources

  1. Ctypes Made Easy by Dublin User Group
  2. Bypassing GIL with ctypes by Christopher Swenson
  3. Python ctypes documentation
This post is licensed under CC BY 4.0 by the author.