Graphics Processing Units (GPUs) have revolutionized computational tasks, making them indispensable not only for graphics rendering but also for scientific computing, artificial intelligence, and deep learning. In this article, we delve into the physical architecture of a GPU, its computational structure, and finally, illustrate how to perform computations using the PyTorch library.
First and foremost, I want to clarify that I am not an expert in computer architecture. This article is my attempt to grasp the concept of GPU architecture. If you notice any errors or have suggestions for improvement, please don’t hesitate to leave a comment below.
The Physical Architecture of a GPU
When examining a GPU’s internal structure, the centerpiece is the large chip—the GA102—specific to Nvidia’s Ampere architecture. It is used in high-end GPUs such as the RTX 3080, RTX 3090, and their Ti variants. Comprising billions of transistors, this intricate design ensures the GPU’s unparalleled processing power and efficiency. Let us explore the hierarchical organization of this chip:
- Graphical Processing Clusters (GPCs):
- The GA102 chip is divided into seven graphical processing clusters (GPCs).
- Streaming Multiprocessors (SMs):
- Each GPC contains 12 streaming multiprocessors (SMs).
- An SM consists of smaller units, including warps, CUDA cores, tensor cores, and RT cores.
- Core Composition:
- A single SM contains four warps and one RT (ray tracing) core.
- Each warp includes 32 CUDA cores and one tensor core.
- Core Distribution Across the GPU:
- The entire GPU hosts 10,752 CUDA cores, 336 tensor cores, and 84 ray tracing cores. These cores execute the bulk of the GPU’s computations.
Core Specialization
- CUDA Cores: These are fundamental processing units, designed for basic arithmetic operations like addition and multiplication. They are extensively used in graphics rendering and general-purpose computation.
- Tensor Cores: Optimized for matrix operations (e.g., matrix addition and multiplication), these cores are critical for deep learning and AI tasks.
- RT Cores: Dedicated to executing ray tracing algorithms, RT cores provide photorealistic rendering by simulating how light interacts with objects.
Memory and Data Transfer
GPUs are data-intensive machines requiring a continuous feed of information. The GPU’s memory subsystem supports this hunger for data:
- The memory chips on the graphic card collectively enable a 382-bit bus width with a bandwidth of around 1.15 TB/s.
- For comparison, CPU-supported DRAM operates with a 64-bit bus width and a maximum bandwidth of 64 GB/s.
Example: Nvidia GA102-based GPUs
Nvidia’s GA102 architecture exemplifies the cutting-edge design of modern GPUs, as seen in models like the RTX 3080, RTX 3090, RTX 3080 Ti, and RTX 3090 Ti. These GPUs leverage the GA102 chip to deliver exceptional performance across gaming, AI, and computational tasks. For instance, the RTX 3090 Ti, a flagship product, stands out with its flawless configuration—zero defective streaming multiprocessors—ensuring optimal performance for demanding applications.
To grasp the scale and complexity of this architecture, consider this: a single CUDA core in the GA102 chip comprises approximately 400,000 transistors, underscoring the immense engineering sophistication that enables the GPU’s unparalleled computational prowess.
GPU Architecture on Apple Silicon
Apple’s GPU architecture is structured hierarchically to optimize parallel processing. At the top level, the Graphics Processing Unit (GPU) comprises multiple GPU cores, each analogous to Nvidia’s Streaming Multiprocessors. Each GPU core contains 16 execution units, and within each execution unit are 8 Arithmetic Logic Units (ALUs). For instance, a MacBook Pro with an M2 chip featuring 30 GPU cores would have a total of 480 execution units (30 cores × 16 execution units per core) and 3,840 ALUs (480 execution units × 8 ALUs per unit). The ALUs are analogous to the Nvidia’s CUDA cores.
Computational Architecture: Mapping Physical Components to Functionality
The GPU’s computational model exemplifies an embarrassingly parallel system, where independent tasks are executed simultaneously with minimal need for intercommunication. This design efficiently maps computational tasks onto physical hardware components:
- Threads:
- Each CUDA core executes a single thread, the smallest unit of computation. These threads operate independently, making them ideal for parallel processing.
- Warps:
- Threads are grouped into units of 32, called warps. Each warp executes the same instruction sequence across its threads in lockstep, ensuring high throughput when execution paths are uniform.
- Thread Blocks:
- Warps are further organized into thread blocks, managed by a single streaming multiprocessor (SM). Each thread block operates independently, avoiding interdependencies with other thread blocks.
- Grids:
- Thread blocks are grouped into grids, which span across the entire GPU, enabling the GPU to scale computations across thousands of CUDA cores effortlessly.
- GigaThread Engine:
- This engine dynamically schedules thread blocks across available SMs, ensuring efficient utilization of computational resources and maintaining an embarrassingly parallel workflow.
This hierarchical structure—from threads to grids—is optimized for massive parallelism, enabling thousands of computations to run concurrently. Importantly, GPUs extend the traditional Single Instruction Multiple Data (SIMD) model into the more flexible Single Instruction Multiple Threads (SIMT) architecture. In SIMT, threads within a warp can handle divergent execution paths while maintaining overall efficiency, further enhancing the GPU’s suitability for tasks with complex parallel workloads.
Example: Performing Computations on a GPU Using PyTorch
Let’s explore how to harness the computational power of a GPU using the PyTorch library. PyTorch seamlessly integrates with CUDA-enabled GPUs, enabling users to leverage their computational capabilities for deep learning and numerical tasks.
The provided example is executed on Apple Silicon, but it would function without any issues on Nvidia GPUs as well.
Step 1: Setting Up the Environment
Ensure that PyTorch is installed with GPU support. You can verify this using:
Step 2: Moving Tensors to GPU
To utilize the GPU, tensors must be explicitly moved to GPU memory. Here is an example:
You can notice that this cell took significant time for execution. The time delay is primarily due to two factors: the creation of large tensors and the transfer of data to the target device. Generating 10000 x 10000
tensors with random values involves allocating significant memory and computing 100 million random values per tensor, which is computationally intensive on the CPU. Once created, transferring these large tensors (approximately 800 MB each for float32
precision) to the GPU or Apple Silicon device incurs additional overhead due to the memory transfer across the PCIe bus (for CUDA) or through the unified memory system (for MPS). These combined steps contribute to the noticeable time taken during initialization.
Step 3: Performing GPU Computation
Once tensors are on the GPU, operations on them are automatically accelerated:
Step 4: Measuring GPU Performance
To assess the GPU’s performance, you can use PyTorch’s built-in tools:
Complete Code
import time
import torch
# Check if CUDA (Nvidia GPU) is available
if torch.cuda.is_available():
print("CUDA GPU Available:", torch.cuda.get_device_name(0))
# Check if MPS (Apple Silicon) is available
elif torch.backends.mps.is_available():
print("MPS (Apple Silicon GPU) Available")
else:
print("No supported GPU backend available")
# Create tensors
matrix_size = 10000
a = torch.randn(matrix_size, matrix_size)
b = torch.randn(matrix_size, matrix_size)
# Detect the device
if torch.cuda.is_available():
device = torch.device("cuda")
elif torch.backends.mps.is_available():
device = torch.device("mps")
else:
device = torch.device("cpu")
# Move tensors to the detected device
a_device = a.to(device)
b_device = b.to(device)
# Perform computation on CPU
start = time.time()
result_cpu = torch.matmul(a, b)
end = time.time()
print("CPU Time:", end - start, "seconds")
# Perform computation on the detected device
start = time.time()
result_device = torch.matmul(a_device, b_device)
end = time.time()
print(f"{device.type.upper()} Time:", end - start, "seconds")
Conclusion
GPUs are marvels of engineering, combining billions of transistors and an intricately designed architecture to deliver unparalleled computational power. By understanding their physical and computational architecture, we can better appreciate how they achieve such efficiency. Libraries like PyTorch make it accessible for researchers and developers to tap into this power for a wide range of applications, from deep learning to high-performance numerical simulations.