This is an official implementation for "Swin Transformer: Hierarchical Vision Transformer using Shifted Windows" on Object Detection and Instance Segmentation.

Last update: Dec 30, 2022

Overview

Swin Transformer for Object Detection

This repo contains the supported code and configuration files to reproduce object detection results of Swin Transformer. It is based on mmdetection.

Updates

04/12/2021 Initial commits

Results and Models

Mask R-CNN

Backbone	Pretrain	Lr Schd	box mAP	mask mAP	#params	FLOPs	config	log	model
Swin-T	ImageNet-1K	3x	46.0	41.6	48M	267G	config	github/baidu	github/baidu
Swin-S	ImageNet-1K	3x	48.5	43.3	69M	359G	config	github/baidu	github/baidu

Cascade Mask R-CNN

Backbone	Pretrain	Lr Schd	box mAP	mask mAP	#params	FLOPs	config	log	model
Swin-T	ImageNet-1K	3x	50.4	43.7	86M	745G	config	github/baidu	github/baidu
Swin-S	ImageNet-1K	3x	51.9	45.0	107M	838G	config	github/baidu	github/baidu
Swin-B	ImageNet-1K	3x	51.9	45.0	145M	982G	config	github/baidu	github/baidu

RepPoints V2

Backbone	Pretrain	Lr Schd	box mAP	mask mAP	#params	FLOPs
Swin-T	ImageNet-1K	3x	50.0	-	45M	283G

Mask RepPoints V2

Backbone	Pretrain	Lr Schd	box mAP	mask mAP	#params	FLOPs
Swin-T	ImageNet-1K	3x	50.3	43.6	47M	292G

Notes:

Pre-trained models can be downloaded from Swin Transformer for ImageNet Classification.
Access code for baidu is swin.

Usage

Installation

Please refer to get_started.md for installation and dataset preparation.

Inference

# single-gpu testing
python tools/test.py <CONFIG_FILE> <DET_CHECKPOINT_FILE> --eval bbox segm

# multi-gpu testing
tools/dist_test.sh <CONFIG_FILE> <DET_CHECKPOINT_FILE> <GPU_NUM> --eval bbox segm

Training

To train a detector with pre-trained models, run:

# single-gpu training
python tools/train.py <CONFIG_FILE> --cfg-options model.pretrained=<PRETRAIN_MODEL> [model.backbone.use_checkpoint=True] [other optional arguments]

# multi-gpu training
tools/dist_train.sh <CONFIG_FILE> <GPU_NUM> --cfg-options model.pretrained=<PRETRAIN_MODEL> [model.backbone.use_checkpoint=True] [other optional arguments]

For example, to train a Cascade Mask R-CNN model with a Swin-T backbone and 8 gpus, run:

tools/dist_train.sh configs/swin/cascade_mask_rcnn_swin_tiny_patch4_window7_mstrain_480-800_giou_4conv1f_adamw_3x_coco.py 8 --cfg-options model.pretrained=<PRETRAIN_MODEL>

Note: use_checkpoint is used to save GPU memory. Please refer to this page for more details.

Apex (optional):

We use apex for mixed precision training by default. To install apex, run:

git clone https://github.com/NVIDIA/apex
cd apex
pip install -v --disable-pip-version-check --no-cache-dir --global-option="--cpp_ext" --global-option="--cuda_ext" ./

If you would like to disable apex, modify the type of runner as EpochBasedRunner and comment out the following code block in the configuration files:

# do not use mmdet version fp16
fp16 = None
optimizer_config = dict(
    type="DistOptimizerHook",
    update_interval=1,
    grad_clip=None,
    coalesce=True,
    bucket_size_mb=-1,
    use_fp16=True,
)

Citing Swin Transformer

@article{liu2021Swin,
  title={Swin Transformer: Hierarchical Vision Transformer using Shifted Windows},
  author={Liu, Ze and Lin, Yutong and Cao, Yue and Hu, Han and Wei, Yixuan and Zhang, Zheng and Lin, Stephen and Guo, Baining},
  journal={arXiv preprint arXiv:2103.14030},
  year={2021}
}

This is an official implementation for "Swin Transformer: Hierarchical Vision Transformer using Shifted Windows" on Object Detection and Instance Segmentation.

Related tags

Overview

Swin Transformer for Object Detection

Updates

Results and Models

Mask R-CNN

Cascade Mask R-CNN

RepPoints V2

Mask RepPoints V2

Usage

Installation

Inference

Training

Apex (optional):

Citing Swin Transformer

Other Links

Owner

Swin Transformer

Tools to create pixel-wise object masks, bounding box labels (2D and 3D) and 3D object model (PLY triangle mesh) for object sequences filmed with an RGB-D camera.

Prototypical python implementation of the trust-region algorithm presented in Sequential Linearization Method for Bound-Constrained Mathematical Programs with Complementarity Constraints by Larson, Leyffer, Kirches, and Manns.

Nonnegative spatial factorization for multivariate count data

Official Pytorch implementation of Scene Representation Networks: Continuous 3D-Structure-Aware Neural Scene Representations

Relative Uncertainty Learning for Facial Expression Recognition

Multiple paper open-source codes of the Microsoft Research Asia DKI group

Official PyTorch implementation of MAAD: A Model and Dataset for Attended Awareness

Self-Supervised Learning with Data Augmentations Provably Isolates Content from Style

A TikTok-like recommender system for GitHub repositories based on Gorse

Shape-Adaptive Selection and Measurement for Oriented Object Detection

Code for EMNLP'21 paper "Types of Out-of-Distribution Texts and How to Detect Them"

Code for "Layered Neural Rendering for Retiming People in Video."

Implementation of "Generalizable Neural Performer: Learning Robust Radiance Fields for Human Novel View Synthesis"

Final project code: Implementing MAE with downscaled encoders and datasets, for ESE546 FA21 at University of Pennsylvania

Official implementation for the paper "Attentive Prototypes for Source-free Unsupervised Domain Adaptive 3D Object Detection"

Pytorch implementation of Learning Rate Dropout.

Tensorflow Tutorials using Jupyter Notebook

[CVPR'21] Projecting Your View Attentively: Monocular Road Scene Layout Estimation via Cross-view Transformation

Occlusion robust 3D face reconstruction model in CFR-GAN (WACV 2022)

Source code for Zalo AI 2021 submission