ConvMAE: Masked Convolution Meets Masked Autoencoders

Last update: Jan 08, 2023

Overview

ConvMAE

ConvMAE: Masked Convolution Meets Masked Autoencoders

Peng Gao¹, Teli Ma¹, Hongsheng Li², Jifeng Dai³, Yu Qiao¹,

¹ Shanghai AI Laboratory, ² MMLab, CUHK, ³ Sensetime Research.

This repo is the official implementation of ConvMAE: Masked Convolution Meets Masked Autoencoders. It currently concludes codes and models for the following tasks:

ImageNet Pretrain: See PRETRAIN.md.
ImageNet Finetune: See FINETUNE.md.
Object Detection: See DETECTION.md.
Semantic Segmentation: See SEGMENTATION.md.

Updates

16/May/2022

The supported codes and models for COCO object detection and instance segmentation are available.

11/May/2022

Pretrained models on ImageNet-1K for ConvMAE.
The supported codes and models for ImageNet-1K finetuning and linear probing are provided.

08/May/2022

The preprint version is public at arxiv.

Introduction

ConvMAE framework demonstrates that multi-scale hybrid convolution-transformer can learn more discriminative representations via the mask auto-encoding scheme.

We present the strong and efficient self-supervised framework ConvMAE, which is easy to implement but show outstanding performances on downstream tasks.
ConvMAE naturally generates hierarchical representations and exhibit promising performances on object detection and segmentation.
ConvMAE-Base improves the ImageNet finetuning accuracy by 1.4% compared with MAE-Base. On object detection with Mask-RCNN, ConvMAE-Base achieves 53.2 box AP and 47.1 mask AP with a 25-epoch training schedule while MAE-Base attains 50.3 box AP and 44.9 mask AP with 100 training epochs. On ADE20K with UperNet, ConvMAE-Base surpasses MAE-Base by 3.6 mIoU (48.1 vs. 51.7).

Pretrain on ImageNet-1K

The following table provides pretrained checkpoints and logs used in the paper.

	ConvMAE-Base
pretrained checkpoints	download
logs	download

Main Results on ImageNet-1K

Models	#Params(M)	Supervision	Encoder Ratio	Pretrain Epochs	FT [email protected](%)	LIN [email protected](%)	FT logs/weights	LIN logs/weights
BEiT	88	DALLE	100%	300	83.0	37.6	-	-
MAE	88	RGB	25%	1600	83.6	67.8	-	-
SimMIM	88	RGB	100%	800	84.0	56.7	-	-
MaskFeat	88	HOG	100%	300	83.6	N/A	-	-
data2vec	88	RGB	100%	800	84.2	N/A	-	-
ConvMAE-B	88	RGB	25%	1600	85.0	70.9	log/weight

Main Results on COCO

Mask R-CNN

Models	Pretrain	Pretrain Epochs	Finetune Epochs	#Params(M)	FLOPs(T)	box AP	mask AP	logs/weights
Swin-B	IN21K w/ labels	300	36	109	0.7	51.4	45.4	-
Swin-L	IN21K w/ labels	300	36	218	1.1	52.4	46.2	-
MViTv2-B	IN21K w/ labels	300	36	73	0.6	53.1	47.4	-
MViTv2-L	IN21K w/ labels	300	36	239	1.3	53.6	47.5	-
Benchmarking-ViT-B	IN1K w/o labels	1600	100	118	0.9	50.4	44.9	-
Benchmarking-ViT-L	IN1K w/o labels	1600	100	340	1.9	53.3	47.2	-
ViTDet	IN1K w/o labels	1600	100	111	0.8	51.2	45.5	-
MIMDet-ViT-B	IN1K w/o labels	1600	36	127	1.1	51.5	46.0	-
MIMDet-ViT-L	IN1K w/o labels	1600	36	345	2.6	53.3	47.5	-
ConvMAE-B	IN1K w/o lables	1600	25	104	0.9	53.2	47.1	log/weight

Main Results on ADE20K

UperNet

Models	Pretrain	Pretrain Epochs	Finetune Iters	#Params(M)	FLOPs(T)	mIoU	logs/weights
DeiT-B	IN1K w/ labels	300	16K	163	0.6	45.6	-
Swin-B	IN1K w/ labels	300	16K	121	0.3	48.1	-
MoCo V3	IN1K	300	16K	163	0.6	47.3	-
DINO	IN1K	400	16K	163	0.6	47.2	-
BEiT	IN1K+DALLE	1600	16K	163	0.6	47.1	-
PeCo	IN1K	300	16K	163	0.6	46.7	-
CAE	IN1K+DALLE	800	16K	163	0.6	48.8	-
MAE	IN1K	1600	16K	163	0.6	48.1	-
ConvMAE-B	IN1K	1600	16K	153	0.6	51.7	soon

Main Results on Kinetics-400

Models	Pretrain Epochs	Finetune Epochs	#Params(M)	Top1	Top5	logs/weights
VideoMAE-B	200	100	87	77.8
VideoMAE-B	800	100	87	79.4
VideoMAE-B	1600	100	87	79.8
VideoMAE-B	1600	100 (w/ Repeated Aug)	87	80.7	94.7
SpatioTemporalLearner-B	800	150 (w/ Repeated Aug)	87	81.3	94.9
VideoConvMAE-B	200	100	86	80.1	94.3	Soon
VideoConvMAE-B	800	100	86	81.7	95.1	Soon
VideoConvMAE-B-MSD	800	100	86	82.7	95.5	Soon

Main Results on Something-Something V2

Models	Pretrain Epochs	Finetune Epochs	#Params(M)	Top1	Top5	logs/weights
VideoMAE-B	200	40	87	66.1
VideoMAE-B	800	40	87	69.3
VideoMAE-B	2400	40	87	70.3
VideoConvMAE-B	200	40	86	67.7	91.2	Soon
VideoConvMAE-B	800	40	86	69.9	92.4	Soon
VideoConvMAE-B-MSD	800	40	86	70.7	93.0	Soon

Getting Started

Prerequisites

Linux
Python 3.7+
CUDA 10.2+
GCC 5+

Training and evaluation

See PRETRAIN.md for pretraining.
See FINETUNE.md for pretrained model finetuning and linear probing.
See DETECTION.md for using pretrained backbone on Mask RCNN.
See SEGMENTATION.md for using pretrained backbone on UperNet.

Acknowledgement

The pretraining and finetuning of our project are based on DeiT and MAE. The object detection and semantic segmentation parts are based on MIMDet and MMSegmentation respectively. Thanks for their wonderful work.

License

ConvMAE is released under the MIT License.

Citation

@article{gao2022convmae,
  title={ConvMAE: Masked Convolution Meets Masked Autoencoders},
  author={Gao, Peng and Ma, Teli and Li, Hongsheng and Dai, Jifeng and Qiao, Yu},
  journal={arXiv preprint arXiv:2205.03892},
  year={2022}
}

ConvMAE: Masked Convolution Meets Masked Autoencoders

Related tags

Overview

ConvMAE

ConvMAE: Masked Convolution Meets Masked Autoencoders

Updates

Introduction

Pretrain on ImageNet-1K

Main Results on ImageNet-1K

Main Results on COCO

Mask R-CNN

Main Results on ADE20K

UperNet

Main Results on Kinetics-400

Main Results on Something-Something V2

Getting Started

Prerequisites

Training and evaluation

Acknowledgement

License

Citation

Owner

Alpha VL Team of Shanghai AI Lab

Distributing Deep Learning Hyperparameter Tuning for 3D Medical Image Segmentation

Codebase for the Summary Loop paper at ACL2020

This repository is maintained for the scientific paper tittled " Study of keyword extraction techniques for Electric Double Layer Capacitor domain using text similarity indexes: An experimental analysis "

Weakly Supervised Learning of Rigid 3D Scene Flow

PyTorch implementation of CVPR 2020 paper (Reference-Based Sketch Image Colorization using Augmented-Self Reference and Dense Semantic Correspondence) and pre-trained model on ImageNet dataset

Semantic Bottleneck Scene Generation

MazeRL is an application oriented Deep Reinforcement Learning (RL) framework

🌎 The Modern Declarative Data Flow Framework for the AI Empowered Generation.

Implementations of orthogonal and semi-orthogonal convolutions in the Fourier domain with applications to adversarial robustness

Boostcamp CV Serving For Python

Lazy, a tool for running things in idle time

TensorFlow (v2.7.0) benchmark results on an M1 Macbook Air 2020 laptop (macOS Monterey v12.1).

An End-to-End Machine Learning Library to Optimize AUC (AUROC, AUPRC).

Convnet transfer - Code for paper How transferable are features in deep neural networks?

A neuroanatomy-based augmented reality experience powered by computer vision. Features 3D visuals of the Atlas Brain Map slices.

Pytorch implementations of Bayes By Backprop, MC Dropout, SGLD, the Local Reparametrization Trick, KF-Laplace, SG-HMC and more

Pytorch implementation of few-shot semantic image synthesis

Implementation of self-attention mechanisms for general purpose. Focused on computer vision modules. Ongoing repository.

Implement face detection, and age and gender classification, and emotion classification.

Anti-Adversarially Manipulated Attributions for Weakly and Semi-Supervised Semantic Segmentation (CVPR 2021)