The pure and clear PyTorch Distributed Training Framework.

Last update: Dec 20, 2022

Overview

The pure and clear PyTorch Distributed Training Framework.

Introduction
Requirements and Usage
Baselines
Zombie processes problem
Acknowledgments
Citation

Introduction

Distribuuuu is a Distributed Classification Training Framework powered by native PyTorch.

Please check tutorial for detailed Distributed Training tutorials:

Single Node Single GPU Card Training [snsc.py]
Single Node Multi-GPU Cards Training (with DataParallel) [snmc_dp.py]
Multiple Nodes Multi-GPU Cards Training (with DistributedDataParallel)
- torch.distributed.launch [mnmc_ddp_launch.py]
- torch.multiprocessing [mnmc_ddp_mp.py]
- Slurm Workload Manager [mnmc_ddp_slurm.py]
ImageNet training example [imagenet.py]

For the complete training framework, please see distribuuuu.

Requirements and Usage

Dependency

Install PyTorch>= 1.6 (has been tested on 1.6, 1.7.1, 1.8 and 1.8.1)
Install other dependencies: pip install -r requirements.txt

Dataset

Download the ImageNet dataset and move validation images to labeled subfolders, using the script valprep.sh.

Expected datasets structure for ILSVRC

ILSVRC
|_ train
|  |_ n01440764
|  |_ ...
|  |_ n15075141
|_ val
|  |_ n01440764
|  |_ ...
|  |_ n15075141
|_ ...

Create a directory containing symlinks:

mkdir -p /path/to/distribuuuu/data

Symlink ILSVRC:

ln -s /path/to/ILSVRC /path/to/distribuuuu/data/ILSVRC

Basic Usage

Single Node with one task

# 1 node, 8 GPUs
python -m torch.distributed.launch \
    --nproc_per_node=8 \
    --nnodes=1 \
    --node_rank=0 \
    --master_addr=localhost \
    --master_port=29500 \
    train_net.py --cfg config/resnet18.yaml

Distribuuuu use yacs, a elegant and lightweight package to define and manage system configurations. You can setup config via a yaml file, and overwrite by other opts. If the yaml is not provided, the default configuration file will be used, please check distribuuuu/config.py.

python -m torch.distributed.launch \
    --nproc_per_node=8 \
    --nnodes=1 \
    --node_rank=0 \
    --master_addr=localhost \
    --master_port=29500 \
    train_net.py --cfg config/resnet18.yaml \
    OUT_DIR /tmp \
    MODEL.SYNCBN True \
    TRAIN.BATCH_SIZE 256

# --cfg config/resnet18.yaml parse config from file
# OUT_DIR /tmp            overwrite OUT_DIR
# MODEL.SYNCBN True       overwrite MODEL.SYNCBN
# TRAIN.BATCH_SIZE 256    overwrite TRAIN.BATCH_SIZE

Single Node with two tasks

# 1 node, 2 task, 4 GPUs per task (8GPUs)
# task 1:
CUDA_VISIBLE_DEVICES=0,1,2,3 python -m torch.distributed.launch \
    --nproc_per_node=4 \
    --nnodes=2 \
    --node_rank=0 \
    --master_addr=localhost \
    --master_port=29500 \
    train_net.py --cfg config/resnet18.yaml

# task 2:
CUDA_VISIBLE_DEVICES=4,5,6,7 python -m torch.distributed.launch \
    --nproc_per_node=4 \
    --nnodes=2 \
    --node_rank=1 \
    --master_addr=localhost \
    --master_port=29500 \
    train_net.py --cfg config/resnet18.yaml

Multiple Nodes Training

# 2 node, 8 GPUs per node (16GPUs)
# node 1:
python -m torch.distributed.launch \
    --nproc_per_node=8 \
    --nnodes=2 \
    --node_rank=0 \
    --master_addr="10.198.189.10" \
    --master_port=29500 \
    train_net.py --cfg config/resnet18.yaml

# node 2:
python -m torch.distributed.launch \
    --nproc_per_node=8 \
    --nnodes=2 \
    --node_rank=1 \
    --master_addr="10.198.189.10" \
    --master_port=29500 \
    train_net.py --cfg config/resnet18.yaml

Slurm Cluster Usage

# see srun --help 
# and https://slurm.schedmd.com/ for details

# example: 64 GPUs
# batch size = 64 * 128 = 8192
# itertaion = 128k / 8192 = 156 
# lr = 64 * 0.1 = 6.4

srun --partition=openai-a100 \
     -n 64 \
     --gres=gpu:8 \
     --ntasks-per-node=8 \
     --job-name=Distribuuuu \
     python -u train_net.py --cfg config/resnet18.yaml \
     TRAIN.BATCH_SIZE 128 \
     OUT_DIR ./resnet18_8192bs \
     OPTIM.BASE_LR 6.4

Baselines

Baseline models trained by Distribuuuu:

We use SGD with momentum of 0.9, a half-period cosine schedule, and train for 100 epochs.
We use a reference learning rate of 0.1 and a weight decay of 5e-5 (1e-5 For EfficientNet).
The actual learning rate(Base LR) for each model is computed as (batch-size / 128) * reference-lr.
Only standard data augmentation techniques(RandomResizedCrop and RandomHorizontalFlip) are used.

PS: use other robust tricks(more epochs, efficient data augmentation, etc.) to get better performance.

Arch	Params(M)	Total batch	Base LR	[email protected]	[email protected]	model / config
resnet18	11.690	256 (32*8GPUs)	0.2	70.902	89.894	Drive / cfg
resnet18	11.690	1024 (128*8GPUs)	0.8	70.994	89.892
resnet18	11.690	8192 (128*64GPUs)	6.4	70.165	89.374
resnet18	11.690	16384 (256*64GPUs)	12.8	68.766	88.381
efficientnet_b0	5.289	512 (64*8GPUs)	0.4	74.540	91.744	Drive / cfg
resnet50	25.557	256 (32*8GPUs)	0.2	77.252	93.430	Drive / cfg
botnet50	20.859	256 (32*8GPUs)	0.2	77.604	93.682	Drive / cfg
regnetx_160	54.279	512 (64*8GPUs)	0.4	79.992	95.118	Drive / cfg
regnety_160	83.590	512 (64*8GPUs)	0.4	80.598	95.090	Drive / cfg
regnety_320	145.047	512 (64*8GPUs)	0.4	80.824	95.276	Drive / cfg

Zombie processes problem

Before PyTorch1.8, torch.distributed.launch will leave some zombie processes after using Ctrl + C, try to use the following cmd to kill the zombie processes. (fairseq/issues/487):

kill $(ps aux | grep YOUR_SCRIPT.py | grep -v grep | awk '{print $2}')

PyTorch >= 1.8 is suggested, which fixed the issue about zombie process. (pytorch/pull/49305)

Acknowledgments

Provided codes were adapted from:

I strongly recommend you to choose pycls, a brilliant image classification codebase and adopted by a number of projects at Facebook AI Research.

Citation

@misc{bigballon2021distribuuuu,
  author = {Wei Li},
  title = {Distribuuuu: The pure and clear PyTorch Distributed Training Framework},
  howpublished = {\url{https://github.com/BIGBALLON/distribuuuu}},
  year = {2021}
}

Feel free to contact me if you have any suggestions or questions, issues are welcome, create a PR if you find any bugs or you want to contribute. 🍰

The pure and clear PyTorch Distributed Training Framework.

Related tags

Overview

Introduction

Requirements and Usage

Dependency

Dataset

Basic Usage

Slurm Cluster Usage

Baselines

Zombie processes problem

Acknowledgments

Citation

Owner

WILL LEE

Codes for NeurIPS 2021 paper "On the Equivalence between Neural Network and Support Vector Machine".

A tiny, pedagogical neural network library with a pytorch-like API.

SSPNet: Scale Selection Pyramid Network for Tiny Person Detection from UAV Images.

CondenseNet: Light weighted CNN for mobile devices

Self-supervised learning optimally robust representations for domain generalization.

SurvITE: Learning Heterogeneous Treatment Effects from Time-to-Event Data

PyTorch implementation of ShapeConv: Shape-aware Convolutional Layer for RGB-D Indoor Semantic Segmentation.

Implementation of "Unsupervised Domain Adaptive 3D Detection with Multi-Level Consistency"

How to use TensorLayer

A custom DeepStack model for detecting 16 human actions.

Bridging Composite and Real: Towards End-to-end Deep Image Matting

Lipschitz-constrained Unsupervised Skill Discovery

Lightweight Python library for adding real-time object tracking to any detector.

The Illinois repository for Climatehack (https://climatehack.ai/). We won 1st place!

Code & Experiments for "LILA: Language-Informed Latent Actions" to be presented at the Conference on Robot Learning (CoRL) 2021.

Iran Open Source Hackathon

DIP-football - A football video analyse system based on Yolov5, alphapose, Qt6

IDA file loader for UF2, created for the DEFCON 29 hardware badge

A simple algorithm for extracting tree height in sparse scene from point cloud data.

Clustergram - Visualization and diagnostics for cluster analysis in Python