Getting Started
Using Bagua is similar to using other distributed training libraries like PyTorch DDP and Horovod.
Migrate from your existing single GPU training code
To use Bagua
, you need make the following changes on your training code:
First, import bagua:
import bagua.torch_api as bagua
Then initialize Bagua's process group:
torch.cuda.set_device(bagua.get_local_rank())
bagua.init_process_group()
Then, use torch's distributed sampler for your data loader:
train_dataset = ...
test_dataset = ...
train_sampler = torch.utils.data.distributed.DistributedSampler(train_dataset,
num_replicas=bagua.get_world_size(), rank=bagua.get_rank())
train_loader = torch.utils.data.DataLoader(
train_dataset,
batch_size=batch_size,
shuffle=(train_sampler is None),
sampler=train_sampler,
)
test_loader = torch.utils.data.DataLoader(test_dataset, ...)
Finally, wrap you model and optimizer with bagua by adding one line of code to your original script:
def main():
args = parse_args()
# define your model and optimizer
model = MyNet().to(args.device)
optimizer = torch.optim.SGD(model.parameters(), lr=args.lr)
# transform to Bagua wrapper
from bagua.torch_api.algorithms import gradient_allreduce
model = model.with_bagua(
[optimizer], gradient_allreduce.GradientAllReduceAlgorithm()
)
# train the model over the dataset
for epoch in range(args.epochs):
for b_idx, (inputs, targets) in enumerate(train_loader):
outputs = model(inputs)
loss = torch.nn.CrossEntropyLoss(outputs, targets)
optimizer.zero_grad()
loss.backward()
optimizer.step()
More examples can be found here.
Launch job
Bagua has a built-in tool bagua.distributed.launch
to launch jobs, whose usage is similar to Pytorch torch.distributed.launch
.
We introduce how to start distributed training in the following sections.
Single node multi-process training
python -m bagua.distributed.launch --nproc_per_node=8 \
your_training_script.py (--arg1 --arg2 ...)
Multi-node multi-process training (e.g. two nodes)
Method 1: run command on each node
Node 1: (IP: 192.168.1.1, and has a free port: 1234)
python -m bagua.distributed.launch --nproc_per_node=8 --nnodes=2 --node_rank=0 --master_addr="192.168.1.1" --master_port=1234 your_training_script.py (--arg1 --arg2 ...)
Node 2:
python -m bagua.distributed.launch --nproc_per_node=8 --nnodes=2 --node_rank=1 --master_addr="192.168.1.1" --master_port=1234 your_training_script.py (--arg1 --arg2 ...)
Tips:
If you need some preprocessing work, you can include them in your bash script and launch job by adding
--no_python
to your command.python -m bagua.distributed.launch --no_python --nproc_per_node=8 bash your_bash_script.sh
Method 2: run command on a single node providing a host list
If the ssh service is available with passwordless login on each node, we can launch a distributed job on a single node with a similar syntax as mpirun
.
Bagua provides a program baguarun
. For the multi-node training example above, the command to start with bagurun
looks like:
baguarun --host_list NODE_1_HOST:NODE_1_SSH_PORT,NODE_2_HOST:NODE_2_SSH_PORT \
--nproc_per_node=NUM_GPUS_YOU_HAVE --master_port=1234 \
YOUR_TRAINING_SCRIPT.py (--arg1 --arg2 --arg3 and all other arguments of your training script)