Distributed Data Parallel#
This guide explains how to install dependencies and run DDP training using TAGI,
which distributes batches across multiple GPUs and synchronizes updating values
for mean and variances (delta_mu_
, delta_var_
) for parameter updates.
Implementation details can be found in src/ddp
.
+----------+
| MPI |
+-----------
/ \
+---------------------+ +---------------------+
| GPU 0 | | GPU 1 |
| Batch A | | Batch B |
| Forward + Backward | | Forward + Backward |
| Compute Updates | | Compute Updates |
+----------+----------+ +----------+----------+
\ /
\ All-Reduce (NCCL) /
+-----------------+
| Aggregated |
| delta_mu_, |
| delta_var_ |
+--------+--------+
|
+---------+-------+
| Update Weights |
+-----------------+
All-Reduce
aggregates deltas
(updates) from all GPUs, typically summing
or averaging (average=True
) them before updating model weights and biases.
Here is an example to select either summing or averaging.
from pytagi.nn import DDPSequential, DDPConfig, Linear, Sequential, ReLU
config = DDPConfig(device_ids=..., backend="nccl", rank=r..., world_size=...)
model = Sequential(Linear(1, 16), ReLU(), Linear(16, 2))
ddp_model = DDPSequential(model, config, average=True)
Requirements#
Ubuntu 22.04 or later
Compatible CUDA version for NCCL
Installation Steps#
1. Install MPI#
MPI is required to run and manage multiple parallel processes.
sudo apt update
sudo apt install openmpi-bin libopenmpi-dev
Verify if MPI is installed on your machine:
mpirun --version
2. Install NCCL#
NCCL handles communication between GPUs. To install it, follow the official NCCL guide.
Below is an example for CUDA 12.2. For other CUDA versions, refer to the guide for appropriate instructions.
wget https://developer.download.nvidia.com/compute/cuda/repos/ubuntu2204/x86_64/cuda-keyring_1.1-1_all.deb
sudo dpkg -i cuda-keyring_1.1-1_all.deb
sudo apt update
sudo apt install libnccl2=2.25.1-1+cuda12.2 libnccl-dev=2.25.1-1+cuda12.2
Verify if NCCL is installed on your machine:
dpkg -l | grep nccl
3. Install MPI4PY#
mpi4py
is the Python package required for Python scripts running on multiple GPUs.
conda install mpi4py
How to Use#
Python Example#
Run CIFAR-10 training with ResNet18 on 2 GPUs:
mpirun -np 2 python -m examples.ddp_cifar_resnet
C++ Example#
Run the ResNet18 test in C++:
mpirun -np 2 build/run_tests --gtest_filter=ResNetDDPTest.ResNet_NCCL
Troubleshooting#
1. MPI4PY installation issue?#
If you cannot install mpi4py
using pip install mpi4py
, a workaround is to
install it using conda:
conda install mpi4py
2. PyTorch Data Loader#
PyTorch’s DataLoader
uses multiprocessing. If you stop the script using
ctrl+c
, press it only once to avoid leaving zombie processes.
To manually kill them:
ps aux | grep ddp_cifar_resnet
Find the process ID (PID), then:
kill -9 <PID>