Official implementation of the ICCV 2021 paper "Conditional DETR for Fast Training Convergence".

Last update: Dec 30, 2022

Overview

Conditional DETR

This repository is an official implementation of the ICCV 2021 paper "Conditional DETR for Fast Training Convergence".

Introduction

The DETR approach applies the transformer encoder and decoder architecture to object detection and achieves promising performance. In this paper, we handle the critical issue, slow training convergence, and present a conditional cross-attention mechanism for fast DETR training. Our approach is motivated by that the cross-attention in DETR relies highly on the content embeddings and that the spatial embeddings make minor contributions, increasing the need for high-quality content embeddings and thus increasing the training difficulty.

Our conditional DETR learns a conditional spatial query from the decoder embedding for decoder multi-head cross-attention. The benefit is that through the conditional spatial query, each cross-attention head is able to attend to a band containing a distinct region, e.g., one object extremity or a region inside the object box (Figure 1). This narrows down the spatial range for localizing the distinct regions for object classification and box regression, thus relaxing the dependence on the content embeddings and easing the training. Empirical results show that conditional DETR converges 6.7x faster for the backbones R50 and R101 and 10x faster for stronger backbones DC5-R50 and DC5-R101.

Model Zoo

We provide conditional DETR and conditional DETR-DC5 models. AP is computed on COCO 2017 val.

Method	Epochs	Params (M)	FLOPs (G)	AP	AP_S	AP_M	AP_L	URL
DETR-R50	500	41	86	42.0	20.5	45.8	61.1	model log
DETR-R50	50	41	86	34.8	13.9	37.3	54.4	model log
DETR-DC5-R50	500	41	187	43.3	22.5	47.3	61.1	model log
DETR-R101	500	60	152	43.5	21.0	48.0	61.8	model log
DETR-R101	50	60	152	36.9	15.5	40.6	55.6	model log
DETR-DC5-R101	500	60	253	44.9	23.7	49.5	62.3	model log
Conditional DETR-R50	50	44	90	41.0	20.6	44.3	59.3	model log
Conditional DETR-DC5-R50	50	44	195	43.7	23.9	47.6	60.1	model log
Conditional DETR-R101	50	63	156	42.8	21.7	46.6	60.9	model log
Conditional DETR-DC5-R101	50	63	262	45.0	26.1	48.9	62.8	model log

Note:

The numbers in the table are slightly differently from the numbers in the paper. We re-ran some experiments when releasing the codes.
"DC5" means removing the stride in C5 stage of ResNet and add a dilation of 2 instead.

Installation

Requirements

Python >= 3.7, CUDA >= 10.1
PyTorch >= 1.7.0, torchvision >= 0.6.1
Cython, COCOAPI, scipy, termcolor

The code is developed using Python 3.8 with PyTorch 1.7.0. First, clone the repository locally:

git clone https://github.com/Atten4Vis/ConditionalDETR.git

Then, install PyTorch and torchvision:

conda install pytorch=1.7.0 torchvision=0.6.1 cudatoolkit=10.1 -c pytorch

Install other requirements:

cd ConditionalDETR
pip install -r requirements.txt

Usage

Data preparation

Download and extract COCO 2017 train and val images with annotations from http://cocodataset.org. We expect the directory structure to be the following:

path/to/coco/
├── annotations/  # annotation json files
└── images/
    ├── train2017/    # train images
    ├── val2017/      # val images
    └── test2017/     # test images

Training

To train conditional DETR-R50 on a single node with 8 gpus for 50 epochs run:

bash scripts/conddetr_r50_epoch50.sh

python -m torch.distributed.launch \
    --nproc_per_node=8 \
    --use_env \
    main.py \
    --resume auto \
    --coco_path /path/to/coco \
    --output_dir output/conddetr_r50_epoch50

The training process takes around 30 hours on a single machine with 8 V100 cards.

Same as DETR training setting, we train conditional DETR with AdamW setting learning rate in the transformer to 1e-4 and 1e-5 in the backbone. Horizontal flips, scales and crops are used for augmentation. Images are rescaled to have min size 800 and max size 1333. The transformer is trained with dropout of 0.1, and the whole model is trained with grad clip of 0.1.

Evaluation

To evaluate conditional DETR-R50 on COCO val with 8 GPUs run:

python -m torch.distributed.launch \
    --nproc_per_node=8 \
    --use_env \
    main.py \
    --batch_size 2 \
    --eval \
    --resume <checkpoint.pth> \
    --coco_path /path/to/coco \
    --output_dir output/<output_path>

Note that numbers vary depending on batch size (number of images) per GPU. Non-DC5 models were trained with batch size 2, and DC5 with 1, so DC5 models show a significant drop in AP if evaluated with more than 1 image per GPU.

License

Conditional DETR is released under the Apache 2.0 license. Please see the LICENSE file for more information.

Citation

@inproceedings{meng2021-CondDETR,
  title       = {Conditional DETR for Fast Training Convergence},
  author      = {Meng, Depu and Chen, Xiaokang and Fan, Zejia and Zeng, Gang and Li, Houqiang and Yuan, Yuhui and Sun, Lei and Wang, Jingdong},
  booktitle   = {Proceedings of the IEEE International Conference on Computer Vision (ICCV)},
  year        = {2021}
}

Official implementation of the ICCV 2021 paper "Conditional DETR for Fast Training Convergence".

Related tags

Overview

Conditional DETR

Introduction

Model Zoo

Installation

Requirements

Usage

Data preparation

Training

Evaluation

License

Citation

Owner

Planner_backend - Academic planner application designed for students and counselors.

Second Order Optimization and Curvature Estimation with K-FAC in JAX.

U-Net for GBM

[CVPR 2022] Official Pytorch code for OW-DETR: Open-world Detection Transformer

Xview3 solution - XView3 challenge, 2nd place solution

Python scripts form performing stereo depth estimation using the HITNET model in Tensorflow Lite.

A Research-oriented Federated Learning Library and Benchmark Platform for Graph Neural Networks. Accepted to ICLR'2021 - DPML and MLSys'21 - GNNSys workshops.

Yolox-bytetrack-sample - Python sample of MOT (Multiple Object Tracking) using YOLOX and ByteTrack

Theano is a Python library that allows you to define, optimize, and evaluate mathematical expressions involving multi-dimensional arrays efficiently. It can use GPUs and perform efficient symbolic differentiation.

[IJCAI'21] Deep Automatic Natural Image Matting

Differential fuzzing for the masses!

retweet 4 satoshi ⚡️

"Graph Neural Controlled Differential Equations for Traffic Forecasting", AAAI 2022

A modular domain adaptation library written in PyTorch.

Pytorch implementation for "Large-Scale Long-Tailed Recognition in an Open World" (CVPR 2019 ORAL)

Official implementation for Likelihood Regret: An Out-of-Distribution Detection Score For Variational Auto-encoder at NeurIPS 2020

StyleGAN2-ada for practice

Python implementation of "Single Image Haze Removal Using Dark Channel Prior"

Official implementation for paper Render In-between: Motion Guided Video Synthesis for Action Interpolation

Koopman operator identification library in Python