Introduction
As data sizes continue to grow, parallel computing is becoming increasingly essential. Modern computers typically come with multiple cores, yet many programs use only a single core, leaving much of the available computational power untapped. Parallel computing allows tasks to be divided among all or a selected number of cores, leading to significant speed improvements.
Python, while a versatile language, can be slow for certain tasks. However, there are several libraries in Python designed to help perform parallel computations efficiently and to speed up even single-threaded jobs. In this post, we’ll cover the basics of Python’s parallel computing libraries such as multiprocessing
, threading
, and joblib
. By the end of this article, you should feel more confident in leveraging parallel computing to accelerate your Python code.
Threading Module
Python includes the threading
module, which enables different parts of a program to run concurrently by spawning one or more threads within the same process. These threads share memory and appear to run in parallel, though in reality, threading does not always lead to true parallelism in Python.
Many developers confuse threading with multiprocessing. The key distinction is that threading does not necessarily achieve true parallel computation because of Python’s Global Interpreter Lock (GIL). The GIL allows only one thread to execute Python bytecode at a time. Therefore, while threading may speed up certain I/O-bound tasks, it is not suited for CPU-bound tasks, as it does not take full advantage of multiple cores.
On the other hand, multiprocessing overcomes this limitation by running tasks in separate processes, each with its own memory space. However, this comes with some overhead, as each process must be initialized separately.
Let’s begin by exploring an example that uses threading
to improve performance:
import time
import threading
def my_awesome_function(sleepTime=0.1):
time.sleep(sleepTime)
# Start timer
Stime = time.perf_counter()
# Create threads for multiple tasks
tasks = []
sleepTimes = [0.1, 0.2, 0.1, 0.5, 0.7, 0.9, 0.5, 0.4, 1.5, 1.3, 1.0, 0.3, 0.7, 0.6, 0.3, 0.8]
print(f"Total time of sleep: {sum(sleepTimes)} for {len(sleepTimes)} tasks")
for sleep in sleepTimes:
t1 = threading.Thread(target=my_awesome_function, args=[sleep])
t1.start()
tasks.append(t1)
for task in tasks:
task.join()
print(f"-- Finished in {time.perf_counter()-Stime:.2f} seconds --")
Output:
Total time of sleep: 9.9 for 16 tasks
-- Finished in 1.51 seconds --
As we can see, even though the cumulative sleep time is about 9.9 seconds, the threaded code finishes much faster, in just over 1.5 seconds. This is because the tasks are run concurrently, reducing the overall execution time.
Another way to run tasks concurrently is by using the concurrent.futures.ThreadPoolExecutor
class, which simplifies thread management.
import time
import concurrent.futures
def my_awesome_function(sleepTime=0.1):
time.sleep(sleepTime)
return f"Slept for {sleepTime} seconds"
# Start timer
Stime = time.perf_counter()
sleepTimes = [0.1, 0.2, 0.1, 0.5, 0.7, 0.9, 0.5, 0.4, 1.5, 1.3, 1.0, 0.3, 0.7, 0.6, 0.3, 0.8]
print(f"Total time of sleep: {sum(sleepTimes)} for {len(sleepTimes)} tasks")
all_results = []
with concurrent.futures.ThreadPoolExecutor() as executor:
tasks = [executor.submit(my_awesome_function, sleep) for sleep in sleepTimes]
for task in concurrent.futures.as_completed(tasks):
all_results.append(task.result())
print(all_results)
print(f"Finished in {time.perf_counter()-Stime:.2f} seconds")
Output:
Finished in 1.46 seconds
This approach provides a more streamlined way of handling concurrent tasks, while also returning the results of each task.
Multiprocessing Module
The multiprocessing
module provides a more powerful alternative to threading by allowing parallelism through multiple processes, each running independently. This side-steps the GIL by using subprocesses, making multiprocessing
ideal for CPU-bound tasks that need true parallel execution.
Here’s an example of sequential processing:
import time
def my_awesome_function(sleepTime):
time.sleep(sleepTime)
# Start timer
Stime = time.perf_counter()
my_awesome_function(0.1)
my_awesome_function(1)
my_awesome_function(0.3)
print(f"Finished in {time.perf_counter()-Stime:.2f} seconds")
Output:
Finished in 1.40 seconds
Now, let’s parallelize the same tasks using multiprocessing
:
import multiprocessing
import time
def my_awesome_function(sleepTime=0.1):
time.sleep(sleepTime)
print(f"Function with sleep time {sleepTime} seconds finished")
# Start timer
Stime = time.perf_counter()
# Create processes
p1 = multiprocessing.Process(target=my_awesome_function)
p2 = multiprocessing.Process(target=my_awesome_function)
p3 = multiprocessing.Process(target=my_awesome_function)
# Start processes
p1.start()
p2.start()
p3.start()
# Ensure the main script waits for processes to complete
p1.join()
p2.join()
p3.join()
print(f"Finished in {time.perf_counter()-Stime:.2f} seconds")
Output:
Function with sleep time 0.1 seconds finished
Finished in 0.11 seconds
As seen in the output, parallel execution reduces the total runtime.
Using concurrent.futures
with ProcessPoolExecutor
For more efficient parallel processing, we can use concurrent.futures.ProcessPoolExecutor
. This simplifies managing multiple processes and allows you to submit tasks and retrieve results asynchronously.
import time
import concurrent.futures
import numpy as np
def my_awesome_function(sleepTime=0.1):
time.sleep(sleepTime)
return sleepTime ** 2
Stime = time.perf_counter()
sleepTimes = np.random.rand(20)
print(f"Total sleep time: {sum(sleepTimes)} for {len(sleepTimes)} tasks")
all_results = []
with concurrent.futures.ProcessPoolExecutor() as executor:
tasks = [executor.submit(my_awesome_function, sleep) for sleep in sleepTimes]
for task in concurrent.futures.as_completed(tasks):
all_results.append(task.result())
print(all_results)
print(f"Finished in {time.perf_counter()-Stime:.2f} seconds")
Threading vs. Multiprocessing
Both threading and multiprocessing have their uses, but the choice depends on the nature of the task:
- Threading: Suitable for I/O-bound tasks that spend a lot of time waiting for external events (e.g., file reading, web scraping).
- Multiprocessing: Ideal for CPU-bound tasks that require intensive computation, as each process runs on a separate core without the limitations of the GIL.
Joblib Module
joblib
is a Python library optimized for parallelizing tasks involving large datasets, especially NumPy arrays. It is designed to reduce repetitive computation by persisting data to disk.
Here’s an example of using joblib
to parallelize a task:
from joblib import Parallel, delayed
import numpy as np
import time
from multiprocessing import cpu_count
numComps = 30
numComps_j = 10
def function_to_test(i, j):
time.sleep(0.1) # Simulating computation
return np.sqrt(i**2) + np.sqrt(j**2)
# Parallel computation with joblib
def joblib_comp():
Stime = time.perf_counter()
executor = Parallel(n_jobs=cpu_count(), backend='multiprocessing')
tasks = (delayed(function_to_test)(i, j) for i in range(numComps) for j in range(numComps_j))
executor(tasks)
Etime = time.perf_counter()
print(f"Joblib finished in {Etime - Stime:.2f} seconds")
# Sequential computation with NumPy
def numpy_comp():
Stime = time.perf_counter()
results = [function_to_test(i, j) for i in range(numComps) for j in range(numComps_j)]
Etime = time.perf_counter()
print(f"Numpy finished in {Etime - Stime:.2f} seconds")
if __name__ == "__main__":
joblib_comp()
numpy_comp()
Output:
Joblib finished in 2.98 seconds
Numpy finished in 30.95 seconds
As seen from the output, joblib
can significantly reduce computation time by utilizing all available CPU cores.
Multiprocessing vs. mpi4py
While multiprocessing
works well on shared memory systems (single machines), it does not support distributed computing across multiple nodes. For this, mpi4py
can be used. We’ll cover the advantages and disadvantages of mpi4py
in a future post.