Speeding Up Your Code with Parallel Computing in Python

Parallel computing is essential for handling large datasets efficiently. In this post, we explore Python's threading, multiprocessing, and joblib libraries to speed up code execution. Learn the differences between threading and multiprocessing, and understand how to use joblib for optimized parallel processing, especially with NumPy arrays.

Introduction

As data sizes continue to grow, parallel computing is becoming increasingly essential. Modern computers typically come with multiple cores, yet many programs use only a single core, leaving much of the available computational power untapped. Parallel computing allows tasks to be divided among all or a selected number of cores, leading to significant speed improvements.

Python, while a versatile language, can be slow for certain tasks. However, there are several libraries in Python designed to help perform parallel computations efficiently and to speed up even single-threaded jobs. In this post, we’ll cover the basics of Python’s parallel computing libraries such as multiprocessing, threading, and joblib. By the end of this article, you should feel more confident in leveraging parallel computing to accelerate your Python code.

Threading Module

Python includes the threading module, which enables different parts of a program to run concurrently by spawning one or more threads within the same process. These threads share memory and appear to run in parallel, though in reality, threading does not always lead to true parallelism in Python.

Many developers confuse threading with multiprocessing. The key distinction is that threading does not necessarily achieve true parallel computation because of Python’s Global Interpreter Lock (GIL). The GIL allows only one thread to execute Python bytecode at a time. Therefore, while threading may speed up certain I/O-bound tasks, it is not suited for CPU-bound tasks, as it does not take full advantage of multiple cores.

On the other hand, multiprocessing overcomes this limitation by running tasks in separate processes, each with its own memory space. However, this comes with some overhead, as each process must be initialized separately.

Let’s begin by exploring an example that uses threading to improve performance:

import time
import threading

def my_awesome_function(sleepTime=0.1):
    time.sleep(sleepTime)

# Start timer
Stime = time.perf_counter()

# Create threads for multiple tasks
tasks = []
sleepTimes = [0.1, 0.2, 0.1, 0.5, 0.7, 0.9, 0.5, 0.4, 1.5, 1.3, 1.0, 0.3, 0.7, 0.6, 0.3, 0.8]
print(f"Total time of sleep: {sum(sleepTimes)} for {len(sleepTimes)} tasks")

for sleep in sleepTimes:
    t1 = threading.Thread(target=my_awesome_function, args=[sleep])
    t1.start()
    tasks.append(t1)

for task in tasks:
    task.join()

print(f"-- Finished in {time.perf_counter()-Stime:.2f} seconds --")

Output:

Total time of sleep: 9.9 for 16 tasks
-- Finished in 1.51 seconds --

As we can see, even though the cumulative sleep time is about 9.9 seconds, the threaded code finishes much faster, in just over 1.5 seconds. This is because the tasks are run concurrently, reducing the overall execution time.

Another way to run tasks concurrently is by using the concurrent.futures.ThreadPoolExecutor class, which simplifies thread management.

import time
import concurrent.futures

def my_awesome_function(sleepTime=0.1):
    time.sleep(sleepTime)
    return f"Slept for {sleepTime} seconds"

# Start timer
Stime = time.perf_counter()

sleepTimes = [0.1, 0.2, 0.1, 0.5, 0.7, 0.9, 0.5, 0.4, 1.5, 1.3, 1.0, 0.3, 0.7, 0.6, 0.3, 0.8]
print(f"Total time of sleep: {sum(sleepTimes)} for {len(sleepTimes)} tasks")

all_results = []
with concurrent.futures.ThreadPoolExecutor() as executor:
    tasks = [executor.submit(my_awesome_function, sleep) for sleep in sleepTimes]
    for task in concurrent.futures.as_completed(tasks):
        all_results.append(task.result())

print(all_results)
print(f"Finished in {time.perf_counter()-Stime:.2f} seconds")

Output:

Finished in 1.46 seconds

This approach provides a more streamlined way of handling concurrent tasks, while also returning the results of each task.

Multiprocessing Module

The multiprocessing module provides a more powerful alternative to threading by allowing parallelism through multiple processes, each running independently. This side-steps the GIL by using subprocesses, making multiprocessing ideal for CPU-bound tasks that need true parallel execution.

Here’s an example of sequential processing:

import time

def my_awesome_function(sleepTime):
    time.sleep(sleepTime)

# Start timer
Stime = time.perf_counter()

my_awesome_function(0.1)
my_awesome_function(1)
my_awesome_function(0.3)

print(f"Finished in {time.perf_counter()-Stime:.2f} seconds")

Output:

Finished in 1.40 seconds

Now, let’s parallelize the same tasks using multiprocessing:

import multiprocessing
import time

def my_awesome_function(sleepTime=0.1):
    time.sleep(sleepTime)
    print(f"Function with sleep time {sleepTime} seconds finished")

# Start timer
Stime = time.perf_counter()

# Create processes
p1 = multiprocessing.Process(target=my_awesome_function)
p2 = multiprocessing.Process(target=my_awesome_function)
p3 = multiprocessing.Process(target=my_awesome_function)

# Start processes
p1.start()
p2.start()
p3.start()

# Ensure the main script waits for processes to complete
p1.join()
p2.join()
p3.join()

print(f"Finished in {time.perf_counter()-Stime:.2f} seconds")

Output:

Function with sleep time 0.1 seconds finished
Finished in 0.11 seconds

As seen in the output, parallel execution reduces the total runtime.

Using concurrent.futures with ProcessPoolExecutor

For more efficient parallel processing, we can use concurrent.futures.ProcessPoolExecutor. This simplifies managing multiple processes and allows you to submit tasks and retrieve results asynchronously.

import time
import concurrent.futures
import numpy as np

def my_awesome_function(sleepTime=0.1):
    time.sleep(sleepTime)
    return sleepTime ** 2

Stime = time.perf_counter()

sleepTimes = np.random.rand(20)
print(f"Total sleep time: {sum(sleepTimes)} for {len(sleepTimes)} tasks")

all_results = []
with concurrent.futures.ProcessPoolExecutor() as executor:
    tasks = [executor.submit(my_awesome_function, sleep) for sleep in sleepTimes]
    for task in concurrent.futures.as_completed(tasks):
        all_results.append(task.result())

print(all_results)
print(f"Finished in {time.perf_counter()-Stime:.2f} seconds")

Threading vs. Multiprocessing

Both threading and multiprocessing have their uses, but the choice depends on the nature of the task:

  • Threading: Suitable for I/O-bound tasks that spend a lot of time waiting for external events (e.g., file reading, web scraping).
  • Multiprocessing: Ideal for CPU-bound tasks that require intensive computation, as each process runs on a separate core without the limitations of the GIL.

Joblib Module

joblib is a Python library optimized for parallelizing tasks involving large datasets, especially NumPy arrays. It is designed to reduce repetitive computation by persisting data to disk.

Here’s an example of using joblib to parallelize a task:

from joblib import Parallel, delayed
import numpy as np
import time
from multiprocessing import cpu_count

numComps = 30
numComps_j = 10

def function_to_test(i, j):
    time.sleep(0.1)  # Simulating computation
    return np.sqrt(i**2) + np.sqrt(j**2)

# Parallel computation with joblib
def joblib_comp():
    Stime = time.perf_counter()
    executor = Parallel(n_jobs=cpu_count(), backend='multiprocessing')
    tasks = (delayed(function_to_test)(i, j) for i in range(numComps) for j in range(numComps_j))
    executor(tasks)
    Etime = time.perf_counter()
    print(f"Joblib finished in {Etime - Stime:.2f} seconds")

# Sequential computation with NumPy
def numpy_comp():
    Stime = time.perf_counter()
    results = [function_to_test(i, j) for i in range(numComps) for j in range(numComps_j)]
    Etime = time.perf_counter()
    print(f"Numpy finished in {Etime - Stime:.2f} seconds")

if __name__ == "__main__":
    joblib_comp()
    numpy_comp()

Output:

Joblib finished in 2.98 seconds
Numpy finished in 30.95 seconds

As seen from the output, joblib can significantly reduce computation time by utilizing all available CPU cores.

Multiprocessing vs. mpi4py

While multiprocessing works well on shared memory systems (single machines), it does not support distributed computing across multiple nodes. For this, mpi4py can be used. We’ll cover the advantages and disadvantages of mpi4py in a future post.

Utpal Kumar
Utpal Kumar

Geophysicist | Geodesist | Seismologist | Open-source Developer
I am a geophysicist with a background in computational geophysics, currently working as a postdoctoral researcher at UC Berkeley. My research focuses on seismic data analysis, structural health monitoring, and understanding deep Earth structures. I have had the opportunity to work on diverse projects, from investigating building characteristics using smartphone data to developing 3D models of the Earth's mantle beneath the Yellowstone hotspot.

In addition to my research, I have experience in cloud computing, high-performance computing, and single-board computers, which I have applied in various projects. This includes working with platforms like AWS, Docker, and Kubernetes, as well as supercomputing environments such as STAMPEDE2, ANVIL, Savio and PERLMUTTER (and CORI). My work involves developing innovative solutions for structural health monitoring and advancing real-time seismic response analysis. I am committed to applying these skills to further research in computational seismology and structural health monitoring.

Articles: 32

Leave a Reply

Your email address will not be published. Required fields are marked *