Boost Your Python Code with CUDA

I’ve written about the Python library Numba before. Check my article out using the link below,

Python on Steroids: The Numba Boost

The TL;DR of the above was that I showed how to realise significant speed up in your Python code using Numba. Numba is a high-performance Python library designed to optimize your code for speed. At its core, Numba is a Just-In-Time (JIT) compiler that translates a subset of Python and NumPy code into fast machine code. This process is automatic and dynamic, allowing Python developers to gain real performance improvements with minimal changes to their original Python code.

The regular Numba JIT compiler is all about optimising code run-time for your CPU, but if you are lucky enough to have access to a GPU, in this article, I’ll show you how you can use Numba again, this time with its Cuda JIT, to accelerate your Python code even further by targeting the GPU to run code on.

Pre-requisites

To use NVIDIA CUDA on your system, you will need the following:-

CUDA-capable GPU
CUDA drivers appropriate to your operating system
CUDA Toolkit (available at https://developer.nvidia.com/cuda-downloads)

For comprehensive instructions, you’re best to visit the official Installation guide at Nvidia. Click here for that.

It would also be useful to get acquainted with some terminology specific to the GPU world. For example,

host: a synonym for the CPU
device: a synonym for the GPU
host memory: your system’s main memory (RAM)
device memory: the onboard memory on your GPU card
kernel: a GPU function launched by the host and executed on the device
device function: a GPU function executed on the device which can only be called from the device (i.e. from a kernel or another device function)
Streaming Multiprocessors: These are the fundamental computational units within an NVIDIA GPU architecture. They are responsible for executing the instructions of threads in parallel, making GPUs highly effective at parallel processing tasks.

Understanding the Memory Hierarchy in GPUs

To fully understand how Numba CUDA programming works, learning the memory hierarchy and GRid layout system as they apply to GPUs is worthwhile. Unlike CPUs, which have a single, unified memory space, GPUs have a hierarchical memory architecture that consists of:

Registers: Small, fast on-chip memory that stores temporary results and variables.
Shared Memory: A small, fast on-chip memory shared among threads within a block.
Global Memory: A large, off-chip memory that stores data and program instructions.
Texture Memory: A read-only memory that stores 2D arrays and is optimized for texture mapping.
Constant Memory: A small, read-only memory that stores constants and is optimized for broadcasting.

Understanding the Grid system in GPUs

Another very important idea to grasp is that of the Grid System. In GPU programming, the grid system is a fundamental concept that allows developers to organize and execute parallel computations on the GPU. The grid system consists of:

The Grid. A 1D, 2D or 3D array of blocks.
The Block. A group of threads that are executed together. Each block can contain a certain number of threads, and all threads within a block can cooperate using shared memory. Threads within a block are typically arranged in 1D, 2D, or 3D structures.
The Thread. The smallest unit of execution. A thread is akin to a single instruction stream running on the GPU. Each thread performs computations on a specific portion of the data.

How the Grid Works

The grid can be defined in 1, 2, or 3 dimensions, depending on the problem you’re solving. For example, if you’re processing a 2D image, you might choose a 2D grid to better map the computational tasks to the data structure.
Each block in the grid can also be 1D, 2D, or 3D. The block dimensions define the number of threads per block.
When you launch a CUDA kernel, you specify the grid and block dimensions. The CUDA runtime distributes the threads across the available Streaming Multiprocessors (SMs) on the GPU. Each block is assigned to an SM, and the threads within the block are distributed among the cores of that SM.

Cuda has several built-in values that can help you determine block and thread positions on the grid. To keep things simple, let’s consider a 2D block arrangement.

Block Location
---------------
bx = cuda.blockIdx.x  ---------> 1 in our example diagram
by = cuda.blockIdx.y  ---------> 1

Block Dimensions
------------------
bw=cuda.blockDim.x    ---------> 3
bh=cuda.blockDim.y    ---------> 3

Block thread location
---------------------
tx=cuda.threadIdx.x   ---------> 0
ty=cuda.threadIdx.y   ---------> 0

Grid thread location
--------------------
X = bw * bx + tx     ----------> 3
Y = bh * by + ty     ----------> 3

       or

X,Y = cuda.grid(2)

Setting up our dev environment

Before we get to the coding, let’s set up a separate development environment for our work. I use conda for this, but you can use whatever method you know and suits you best.

#create our test environment
(base) $ conda create -n numba_cuda Python=3.11 -y

# Now activate it
(base) $ conda activate numba_cuda
(numba_cuda) $

Now that our environment is set up, we can install the required libraries and software.

According to the Numba requirements for Cuda programming, as I have CUDA 12 installed, I needed the following libraries,

 (numba_cuda) $ conda install -c conda-forge cuda-nvcc cuda-nvrtc "cuda-version>=12.0"

I also need these,

(numba_cuda) $ conda install numba jupyter  -y
(numba_cuda) $ pip install matplotlib

Numba CUDA in use

For our tests, I’ll repeat some of the programming snippets I used in my Numba JIT article, and we’ll see how much of an improvement we can squeeze out of converting them to use Numba CUDA.

Example 1 – Simple for loop test

Numba JIT version

from numba import jit
import time

# Decorate the function with @jit to enable JIT compilation
@jit(nopython=True)  # nopython mode is recommended for best performance
def loop_test_jit():
    result = 0.0
    # Outer loop
    for i in range(10000):
        # Inner loop
        for j in range(10000):
            # Perform a simple operation
            result += i * j * 0.1
    return result

# Call the function to allow Numba to compile it
loop_test_jit()

# Record start time
start_time = time.time()

# Call the JIT-compiled function
for i in range(5):
    result = loop_test_jit()

# Record end time
end_time = time.time()

# Calculate and print the execution time
print(f"CUDA JIT result = {result}")
print(f"Execution time: {(end_time - start_time)/5} seconds")

#
# Output  below
#
NUMBA JIT result = 249950002500000.0
Execution time: 0.09600849151611328 seconds

Recall that the first time Numba encounters a function, it takes some time to compile it before running it. Therefore, I run the function once for the compilation stage, then call it again in a loop 5 times and take the average time per run in the loop. This should give a fair comparison between run times.

The Numba CUDA version

from numba import cuda
import numpy as np
import time

# Define the number of threads that will run per block
threads_per_block = 256

# Define the CUDA kernel function
@cuda.jit
def loop_test_kernel(results):
    i = cuda.grid(1)
    # Make sure we don't go out of bounds
    if i < results.size:
        result = 0.0
        for j in range(10000):
            result += i * j * 0.1
        results[i] = result

# Main function to manage the computation
def loop_test_cuda():
    num_elements = 10000
    # calculates the number of blocks (blocks_per_grid) needed to 
    # process all num_elements with the given number of threads per block.
    blocks_per_grid = (num_elements + (threads_per_block - 1)) // threads_per_block

    # Allocate space for the results on the device (GPU)
    results = cuda.device_array(num_elements, dtype=np.float64)

    # Launch the kernel on the GPU with the required
    # number of blocks and threads
    loop_test_kernel[blocks_per_grid, threads_per_block](results)

    # Copy the results back to the host (CPU)
    results_host = results.copy_to_host()

    # Aggregate the results
    return results_host.sum()

# Warm up the CUDA kernel to allow JIT compilation
loop_test_cuda()

# Record start time
start_time = time.time()

# Call the CUDA function
for i in range(5):
    result = loop_test_cuda()

# Record end time
end_time = time.time()

# Calculate and print the execution time
print(f"NUMBA CUDA result = {result}")
print(f"Execution time: {(end_time - start_time)/5} seconds")

#
# Output  below
#
NUMBA CUDA result = 249950002500000.0
Execution time: 0.01670536994934082 seconds

Straight away, we see a 6x improvement on a piece of code that was already quick.

The CUDA code is more complex; most of which comes from the mapping we must do when allocating the for-loop processes to threads on the GPU.

I also received the following warning message when the code ran…

NumbaPerformanceWarning: Grid size 40 will likely result in 
GPU under-utilization due to low occupancy.

So, there’s scope for playing around with some of the numbers to see if the runtime can be improved further. For example, the warning message disappeared when I changed the threads_per_block variable from 256 to 64. This increases the number of blocks per grid, which is counter-intuitive.

Example 2 – recursive functions

Numba can also speed up recursive function calling. Rather than go down the Fibonacci road, we’ll try a similar algorithm you might not have heard of before called Lucas numbers. Lucas numbers are similar to Fibonacci numbers, following the same recursive pattern but starting with different initial values. The Lucas sequence starts with 2 and 1 instead of 0 and 1 for the Fibonacci sequence. The nth Lucas number can be defined recursively as L(n)=L(n−1)+L(n−2) with base cases L(0)=2 and L(1)=1.

Numba JIT Version

from numba import jit

# Apply Numba's JIT decorator
@jit(nopython=True)
def lucas_numba(n):
    if n == 0:
        return 2
    elif n == 1:
        return 1
    else:
        return lucas_numba(n-1) + lucas_numba(n-2)

lucas_result_numba = lucas_numba(40)  # Example input

# Timing the JIT-compiled function
start_time = time.time()
for _ in range(5):
    lucas_result_numba = lucas_numba(40)  # Example input
end_time = time.time()

print(f"Lucas number 40 with Numba: {lucas_result_numba}")
print(f"Execution time with Numba: {(end_time - start_time)/5} seconds")

# 
# Output
#

Lucas number 40 with CUDA: 228826127
Execution time with Numba: 0.7562449932098388 seconds

Numba Cuda version

from numba import cuda
import numpy as np
import time

# CUDA kernel to calculate Lucas numbers
@cuda.jit
def lucas_cuda(n, result):
    i = cuda.grid(1)  # 1D grid, i represents the index in the array

    if i <= n:  # Ensure we don't go out of bounds
        if i == 0:
            result[i] = 2
        elif i == 1:
            result[i] = 1
        else:
            a = 2
            b = 1
            for j in range(2, i + 1):
                c = a + b
                a = b
                b = c
            result[i] = b

# Define the target number (40th Lucas number)
n = 40

# Allocate result array on the device
result = np.zeros(n + 1, dtype=np.int32)  # We need an array of size 41 (0-40)
result_device = cuda.to_device(result)

# Define threads per block and blocks per grid
# There's a bit of trial and error to this
threads_per_block = 128  
blocks_per_grid = (n + (threads_per_block - 1)) // threads_per_block

# Launch the CUDA kernel
start_time = time.time()
lucas_cuda[blocks_per_grid, threads_per_block](n, result_device)
# Wait till all threads are done
cuda.synchronize()
end_time = time.time()

# Copy the result back to the host
result_host = result_device.copy_to_host()

# Print the 40th Lucas number (index 40)
print(f"Lucas number for {n} with CUDA: {result_host[n]}")
print(f"Execution time with CUDA: {end_time - start_time} seconds")

#
# Output
#

Lucas number 40 with CUDA: 228826127
EExecution time with CUDA: 0.10776114463806152 seconds

Approximately a 7x speed up on the original Numba JIT code that time.

Example 3 – image processing

In this test, we take an image of the Taj Mahal and convert it to greyscale. On my system, the original colour image (PNG format) was 3.7 MB in size.

Numba JIT version

from numba import jit
import numpy as np
import matplotlib.pyplot as plt
from matplotlib.image import imread

# Numba-optimized function to convert RGB to grayscale
@jit(nopython=True)
def rgb_to_grayscale_numba(rgb):
    # Preallocate the output grayscale array
    grayscale = np.zeros((rgb.shape[0], rgb.shape[1]), dtype=np.float64)

    # Loop through each pixel and apply grayscale conversion
    for i in range(rgb.shape[0]):
        for j in range(rgb.shape[1]):
            grayscale[i, j] = (0.299 * rgb[i, j, 0] + 
                               0.587 * rgb[i, j, 1] + 
                               0.114 * rgb[i, j, 2])
    return grayscale

# Load the image 
img = imread("d:/images/enlarged_taj_mahal.png")

grayscale_img_numba = rgb_to_grayscale_numba(img)

# Just timing the numba part
start_time = time.time()
for _ in range(5):
    # Convert to grayscale using Numba
    grayscale_img_numba = rgb_to_grayscale_numba(img)

print(f"Numba Execution Time: {time.time() - start_time} seconds")

# Display the original and grayscale images
plt.figure(figsize=(10, 5))

plt.subplot(1, 2, 1)
plt.imshow(img)
plt.title('Original Image')
plt.axis('off')

plt.subplot(1, 2, 2)
plt.imshow(grayscale_img_numba, cmap='gray')
plt.title('Grayscale Image with Numba JIT')
plt.axis('off')

plt.show()

The output is:-

Original image by Yury Taranik (licensed from Shutterstock)

Numba CUDA version

from numba import cuda
import numpy as np
import matplotlib.pyplot as plt
from matplotlib.image import imread
import time

# CUDA kernel to convert RGB to grayscale
@cuda.jit
def rgb_to_grayscale_cuda(rgb, grayscale):
    i, j = cuda.grid(2)  # Get the 2D grid index for each thread

    if i < rgb.shape[0] and j < rgb.shape[1]:  # Check bounds
        grayscale[i, j] = (0.299 * rgb[i, j, 0] + 
                           0.587 * rgb[i, j, 1] + 
                           0.114 * rgb[i, j, 2])

# Load the image
img = imread("d:/images/enlarged_taj_mahal.png")

# Preallocate the output grayscale array on the host
grayscale_img = np.zeros((img.shape[0], img.shape[1]), dtype=np.float32)

# Allocate device memory for the input and output images
img_device = cuda.to_device(img)
grayscale_img_device = cuda.device_array((img.shape[0], img.shape[1]), dtype=np.float32)

# Define the threads per block and blocks per grid
threads_per_block = (16, 16)  # 16x16 threads per block is a common choice
blocks_per_grid_x = (img.shape[0] + threads_per_block[0] - 1) // threads_per_block[0]
blocks_per_grid_y = (img.shape[1] + threads_per_block[1] - 1) // threads_per_block[1]
blocks_per_grid = (blocks_per_grid_x, blocks_per_grid_y)

rgb_to_grayscale_cuda[blocks_per_grid, threads_per_block](img_device, grayscale_img_device)

# Start timing
start_time = time.time()
for _ in range(5):
    # Launch the CUDA kernel
    rgb_to_grayscale_cuda[blocks_per_grid, threads_per_block](img_device, grayscale_img_device)

# Copy the result back to the host
grayscale_img = grayscale_img_device.copy_to_host()

print(f"CUDA Execution Time: {time.time() - start_time} seconds")

# Display the original and grayscale images
plt.figure(figsize=(10, 5))

plt.subplot(1, 2, 1)
plt.imshow(img)
plt.title('Original Image')
plt.axis('off')

plt.subplot(1, 2, 2)
plt.imshow(grayscale_img, cmap='gray')
plt.title('Grayscale Image with NUMBA CUDA')
plt.axis('off')

plt.show()

And the output?

It only doubled the run time speed on this occasion, but it’s still pretty impressive.

Summary

In this article, I’ve described how, with little effort, you can squeeze even more performance from your Python code – if you have access to a GPU.

The above timing improvements may not seem that impressive. But bear in mind that the base level we were starting from was already an incredibly improved position over our initial non-optimised code using Numba JIT.

For example, look at the progression in the runtime of the Lucas number calculation from Regular code -> Numba JIT -> Numba CUDA

Regular Python: 29 sec
     NUMBA JIT: 0.71 sec
    NUMBA CUDA: 0.1 sec

That’s almost a 300x speed-up on the non-optimised code.

_OK, that’s all for me just now. I hope you found this article useful. If you did, please check out my profile page at this link. From there, you can see my other published stories and subscribe to get notified when I post new content._

I know times are tough and wallets constrained, but if you got real value from this article, please consider buying me a wee dram.

If you liked this content, I think you’ll find these articles interesting, too.