PoolFormer: MetaFormer is Actually What You Need for Vision

Overview

PoolFormer: MetaFormer is Actually What You Need for Vision (arXiv)

This is a PyTorch implementation of PoolFormer proposed by our paper "MetaFormer is Actually What You Need for Vision".

MetaFormer

Figure 1: MetaFormer and performance of MetaFormer-based models on ImageNet-1K validation set. We argue that the competence of transformer/MLP-like models primarily stems from the general architecture MetaFormer instead of the equipped specific token mixers. To demonstrate this, we exploit an embarrassingly simple non-parametric operator, pooling, to conduct extremely basic token mixing. Surprisingly, the resulted model PoolFormer consistently outperforms the DeiT and ResMLP as shown in (b), which well supports that MetaFormer is actually what we need to achieve competitive performance.

PoolFormer Figure 2: (a) The overall framework of PoolFormer. (b) The architecture of PoolFormer block. Compared with transformer block, it replaces attention with an extremely simple non-parametric operator, pooling, to conduct only basic token mixing.

Bibtex

@article{yu2021metaformer,
  title={MetaFormer is Actually What You Need for Vision},
  author={Yu, Weihao and Luo, Mi and Zhou, Pan and Si, Chenyang and Zhou, Yichen and Wang, Xinchao and Feng, Jiashi and Yan, Shuicheng},
  journal={arXiv preprint arXiv:2111.11418},
  year={2021}
}

1. Requirements

For Image Classification (Configs of detection and segmentation will be available soon)

torch>=1.7.0; torchvision>=0.8.0; pyyaml; apex-amp (if you want to use fp16); timm (pip install git+https://github.com/rwightman/p[email protected])

data prepare: ImageNet with the following folder structure, you can extract ImageNet by this script.

│imagenet/
├──train/
│  ├── n01440764
│  │   ├── n01440764_10026.JPEG
│  │   ├── n01440764_10027.JPEG
│  │   ├── ......
│  ├── ......
├──val/
│  ├── n01440764
│  │   ├── ILSVRC2012_val_00000293.JPEG
│  │   ├── ILSVRC2012_val_00002138.JPEG
│  │   ├── ......
│  ├── ......

Directory structure in this repo:

│poolformer/
├──misc/
├──models/
│  ├── __init__.py
│  ├── poolformer.py
├──LICENSE
├──README.md
├──distributed_train.sh
├──train.py
├──validate.py

2. PoolFormer Models

Model #params Image resolution Top1 Acc Download
poolformer_s12 12M 224 77.2 here
poolformer_s24 21M 224 80.3 here
poolformer_s36 31M 224 81.4 here
poolformer_m36 56M 224 82.1 here
poolformer_m48 73M 224 82.5 here

All the pretrained models can also be downloaded by BaiDu Yun (password: esac).

Update ResNet Scores in the paper

Updated_ResNet_Scores

[1] He et al., "Deep Residual Learning for Image Recognition", CVPR 2016.

[2] Wightman et al., "Resnet strikes back: An improved training procedure in timm", arXiv preprint arXiv:2110.00476. 2021 Oct 1.

Usage

We also provide a Colab notebook which run the steps to perform inference with poolformer.

3. Validation

To evaluate our PoolFormer models, run:

MODEL=poolformer_s12 # poolformer_{s12, s24, s36, m36, m48}
python3 validate.py /path/to/imagenet  --model $MODEL \
  --checkpoint /path/to/checkpoint -b 128

4. Train

We show how to train PoolFormers on 8 GPUs. The relation between learning rate and batch size is lr=bs/1024*1e-3. For convenience, assuming the batch size is 1024, then the learning rate is set as 1e-3 (for batch size of 1024, setting the learning rate as 2e-3 sometimes sees better performance).

MODEL=poolformer_s12 # poolformer_{s12, s24, s36, m36, m48}
DROP_PATH=0.1 # drop path rates [0.1, 0.1, 0.2, 0.3, 0.4] responding to model [s12, s24, s36, m36, m48]
CUDA_VISIBLE_DEVICES=0,1,2,3,4,5,6,7 ./distributed_train.sh 8 /path/to/imagenet \
  --model $MODEL -b 128 --lr 1e-3 --drop-path $DROP_PATH --apex-amp

5. Acknowledgment

Our implementation is mainly based on the following codebases. We gratefully thank the authors for their wonderful works.

pytorch-image-models, mmdetection, mmsegmentation.

Besides, Weihao Yu would like to thank TPU Research Cloud (TRC) program for the support of partial computational resources.

LICENSE

This repo is under the Apache-2.0 license. For commercial use, please contact the authors.

Comments
  • About Normalization

    About Normalization

    Hi, thanks for your excellent work. In your ablation studies (section 4.4), you compared Group Normalization (group number is set as 1 for simplicity), Layer Normalization, and Batch Normalization. The conclusion is that Group Normalization is 0.7% or 0.8% higher than Layer Normalization or Batch Normalization. But when the number of groups is 1, Group Normalization is equivalent to Layer Normalization, right?

    opened by tinyalpha 10
  • Addition of the Organization on HuggingFace Transformers

    Addition of the Organization on HuggingFace Transformers

    Hello PoolFormer team!

    I have been working on porting the implementation of PoolFormer to HuggingFace Transformers library (you can see my PR here) and I was wondering if I can go ahead and add Sea AI labs as an organization to the HuggingFace models hub.

    This will allow all model checkpoints to be uploaded onto the hub as well as model cards, etc.

    Kind regards, Tanay Mehta

    opened by heytanay 7
  • How to measure MACs?

    How to measure MACs?

    Hi, thanks for your nice work :) I also watched your presentation record through this conference.

    I want to apply the poolformer for my work, can I ask how did you measure the MACs of the architecture introduced in your paper? Or if you were not bothered, I want to ask if I could be shared your measurement code.

    opened by DoranLyong 5
  • why use use_layer_scale

    why use use_layer_scale

    thanks for your great contribution! in the implement for poolformerblock ,there is a layer_scale after token_mixer. What is the impact of this operation?

    opened by rtfgithub 5
  • Invitation of making PR for OpenMMLab / MMSegmentation.

    Invitation of making PR for OpenMMLab / MMSegmentation.

    Hi, first congrats for acceptance of CVPR'2022. This work deserves because it is very great.

    I am a member of OpenMMLab and mainly work for developing MMSegmentation. I think if it supported officially, many more people would use it for benchmark, which would promote research in computer vision area.

    Would you like to make PR for openmmlab? We could discuss together to refactor your code and use our own GPUs to train & re-implement.

    I think it is pretty cool because it would make more reseachers and community members use this excellent work! Here is our re-implementing work: ConvNeXt.

    We do hope PoolFormer could also be added as backbones in our codebase so that many researchers could use directly it for downstream tasks.

    Looking forward to your reply!

    Best,

    opened by MengzhangLI 5
  • why the speed slower than pvtv2-b1?

    why the speed slower than pvtv2-b1?

    Recently I trained a transformer based instance seg model, tested with different backbone, here is the result and speed test:

    image

    batchsize is training batchsize. Why the speed of poolformer is the slowest one? is that normal?

    Slower than pvtv2-b1 and precision less than it...

    opened by jinfagang 5
  • Checkpoints of the Ablation study

    Checkpoints of the Ablation study

    Hi, thanks for your amazing work. I am reading the Tab 6, and I am surprised because the method is so simple and very effective, especially when the Pooling is replaced with Identity Mapping. Top1 74.3 on ImageNet-1k with only Conv1x1 and Norm layer. I am thrilled... Can you release this checkpoint so that we can verify. Thanks again. image

    opened by chuong98 5
  • Design on positional embedding?

    Design on positional embedding?

    Hello authors,

    I appreciate a lot your current work, which inspired the community. I am here to raise a very simple and quick question after checking the code and architecture design.

    I observed that network using pooling, MLP or identical as token mixer, you do not include positional embedding, while you consider this component only when you use MHA. What is the concern of this design and why other models do not rely on this embedding?

    Best,

    discussion 
    opened by jizongFox 4
  • Error: About self.pool(x)

    Error: About self.pool(x)

    Hello, I am more interested in the poolformer you proposed, but an error occurred during the use of PoolFormerBlock, as follows: Traceback (most recent call last): File "train.py", line 545, in train(hyp, opt, device, tb_writer) File "train.py", line 89, in train model = Model(opt.cfg or ckpt['model'].yaml, ch=3, nc=nc, anchors=hyp.get('anchors')).to(device) # create File "E:\Work\yolov5\models\yolo.py", line 106, in init m.stride = torch.tensor([s / x.shape[-2] for x in self.forward(torch.zeros(1, ch, s, s))]) # forward File "E:\Work\yolov5\models\yolo.py", line 138, in forward return self.forward_once(x, profile) # single-scale inference, train File "E:\Work\yolov5\models\yolo.py", line 157, in forward_once x = m(x) # run # 执行网络组件操作 File "C:\conda\conda\envs\torch17\lib\site-packages\torch\nn\modules\module.py", line 727, in _call_impl result = self.forward(*input, **kwargs) File "E:\Work\yolov5_T23\models\common.py", line 194, in forward n = self.token_mixer(m) File "C:\conda\conda\envs\torch17\lib\site-packages\torch\nn\modules\module.py", line 727, in _call_impl result = self.forward(*input, **kwargs) File "E:\Work\yolov5_T23\models\Confor_VC.py", line 93, in forward x1 = self.pool(x) - x # x1 = self.pool(x) - x File "C:\conda\conda\envs\torch17\lib\site-packages\torch\nn\modules\module.py", line 727, in _call_impl result = self.forward(*input, **kwargs) File "C:\conda\conda\envs\torch17\lib\site-packages\torch\nn\modules\pooling.py", line 594, in forward return F.avg_pool2d(input, self.kernel_size, self.stride, TypeError: avg_pool2d(): argument 'kernel_size' (position 2) must be tuple of ints, not bool

    I want to put the poolformer behind a ConvBlock and the above problem occurred。 thank you!

    opened by QY1994-0919 4
  • About MLN(Modified Layer Normalization)

    About MLN(Modified Layer Normalization)

    This paper provides new perspectives about Transformer block, but I have some questions about one of the details. As far as I know, the LayerNorm officially provided by Pytorch implements the same function as the MLN, which computes the mean and variance along token and channel dimensions. So where is the improvement? image The official example : #Image Example N, C, H, W = 20, 5, 10, 10 input = torch.randn(N, C, H, W) #Normalize over the last three dimensions (i.e. the channel and spatial dimensions) #as shown in the image below layer_norm = nn.LayerNorm([C, H, W]) output = layer_norm(input)

    opened by youngtboy 3
  • How to achieve the grad-CAM visualization?

    How to achieve the grad-CAM visualization?

    Thanks for your awesome work and for sharing them all.

    I found out that the pictures in the supplement paper are beautiful, and I want to follow this.

    Could you share the code for this? or can tell me how to achieve the grad-CAM activation map?

    opened by DoranLyong 3
  • CVE-2007-4559 Patch

    CVE-2007-4559 Patch

    Patching CVE-2007-4559

    Hi, we are security researchers from the Advanced Research Center at Trellix. We have began a campaign to patch a widespread bug named CVE-2007-4559. CVE-2007-4559 is a 15 year old bug in the Python tarfile package. By using extract() or extractall() on a tarfile object without sanitizing input, a maliciously crafted .tar file could perform a directory path traversal attack. We found at least one unsantized extractall() in your codebase and are providing a patch for you via pull request. The patch essentially checks to see if all tarfile members will be extracted safely and throws an exception otherwise. We encourage you to use this patch or your own solution to secure against CVE-2007-4559. Further technical information about the vulnerability can be found in this blog.

    If you have further questions you may contact us through this projects lead researcher Kasimir Schulz.

    opened by TrellixVulnTeam 0
  • On the use of Apex AMP and hybrid stages

    On the use of Apex AMP and hybrid stages

    Is there a specific reason why you used Apex AMP instead of the native AMP provided by PyTorch? Have you tried native AMP?

    I tried to train poolformer_s12 and poolformer_s24 with solo-learn; with native fp16 the loss goes to nan after a few epochs, while with fp32 it works fine. Did you experience similar behavior?

    On a side note, can you provide the implementation and the hyperparameters for the hybrid stage [Pool, Pool, Attention, Attention]? It seems very interesting!

    discussion 
    opened by DonkeyShot21 6
  • Can I say PoolFormer is just a non-trainable MLP-like module?

    Can I say PoolFormer is just a non-trainable MLP-like module?

    Hi! Thanks for sharing the great work! I have some questions about PoolFormer. If I explain PoolFormer like the following attachments, can I say PoolFormer is just a non-trainable MLP-like model?

    image image

    discussion 
    opened by 072jiajia 8
  • About subtract in pooling

    About subtract in pooling

    Hi, thank you for publishing such a nice paper. I just have one question. I do not understand the subtraction of the input in eqn.4. Is it necessary? What will happen if we just do the average pooling without substrating the input?

    discussion 
    opened by Dong-Huo 16
Owner
Sea AI Lab
Sea AI Lab
GNPy: Optical Route Planning and DWDM Network Optimization

GNPy is an open-source, community-developed library for building route planning and optimization tools in real-world mesh optical networks

Telecom Infra Project 140 Dec 19, 2022
Scene-Text-Detection-and-Recognition (Pytorch)

Scene-Text-Detection-and-Recognition (Pytorch) Competition URL: https://tbrain.t

Gi-Luen Huang 9 Jan 02, 2023
PyTorch implementation of Memory-based semantic segmentation for off-road unstructured natural environments.

MemSeg: Memory-based semantic segmentation for off-road unstructured natural environments Introduction This repository is a PyTorch implementation of

11 Nov 28, 2022
Large-scale Hyperspectral Image Clustering Using Contrastive Learning, CIKM 21 Workshop

Spectral-spatial contrastive clustering (SSCC) Yaoming Cai, Yan Liu, Zijia Zhang, Zhihua Cai, and Xiaobo Liu, Large-scale Hyperspectral Image Clusteri

Yaoming Cai 4 Nov 02, 2022
A Web API for automatic background removal using Deep Learning. App is made using Flask and deployed on Heroku.

Automatic_Background_Remover A Web API for automatic background removal using Deep Learning. App is made using Flask and deployed on Heroku. 👉 https:

Gaurav 16 Oct 29, 2022
[NeurIPS 2021] ORL: Unsupervised Object-Level Representation Learning from Scene Images

Unsupervised Object-Level Representation Learning from Scene Images This repository contains the official PyTorch implementation of the ORL algorithm

Jiahao Xie 55 Dec 03, 2022
Happywhale - Whale and Dolphin Identification Silver🥈 Solution (26/1588)

Kaggle-Happywhale Happywhale - Whale and Dolphin Identification Silver 🥈 Solution (26/1588) 竞赛方案思路 图像数据预处理-标志性特征图片裁剪:首先根据开源的标注数据训练YOLOv5x6目标检测模型,将训练集

Franxx 20 Nov 14, 2022
Exploration-Exploitation Dilemma Solving Methods

Exploration-Exploitation Dilemma Solving Methods Medium article for this repo - HERE In ths repo I implemented two techniques for tackling mentioned t

Aman Mishra 6 Jan 25, 2022
Inflated i3d network with inception backbone, weights transfered from tensorflow

I3D models transfered from Tensorflow to PyTorch This repo contains several scripts that allow to transfer the weights from the tensorflow implementat

Yana 479 Dec 08, 2022
Official implementation for the paper: Multi-label Classification with Partial Annotations using Class-aware Selective Loss

Multi-label Classification with Partial Annotations using Class-aware Selective Loss Paper | Pretrained models Official PyTorch Implementation Emanuel

99 Dec 27, 2022
Radar-to-Lidar: Heterogeneous Place Recognition via Joint Learning

radar-to-lidar-place-recognition This page is the coder of a pre-print, implemented by PyTorch. If you have some questions on this project, please fee

Huan Yin 37 Oct 09, 2022
Object-aware Contrastive Learning for Debiased Scene Representation

Object-aware Contrastive Learning Official PyTorch implementation of "Object-aware Contrastive Learning for Debiased Scene Representation" by Sangwoo

43 Dec 14, 2022
Code for the SIGGRAPH 2021 paper "Consistent Depth of Moving Objects in Video".

Consistent Depth of Moving Objects in Video This repository contains training code for the SIGGRAPH 2021 paper "Consistent Depth of Moving Objects in

Google 203 Jan 05, 2023
Super Pix Adv - Offical implemention of Robust Superpixel-Guided Attentional Adversarial Attack (CVPR2020)

Super_Pix_Adv Offical implemention of Robust Superpixel-Guided Attentional Adver

DLight 8 Oct 26, 2022
A Python package for faster, safer, and simpler ML processes

Bender 🤖 A Python package for faster, safer, and simpler ML processes. Why use bender? Bender will make your machine learning processes, faster, safe

Otovo 6 Dec 13, 2022
Official implementation of FCL-taco2: Fast, Controllable and Lightweight version of Tacotron2 @ ICASSP 2021

FCL-Taco2: Towards Fast, Controllable and Lightweight Text-to-Speech synthesis (ICASSP 2021) Paper | Demo Block diagram of FCL-taco2, where the decode

Disong Wang 39 Sep 28, 2022
Code for EMNLP'21 paper "Types of Out-of-Distribution Texts and How to Detect Them"

ood-text-emnlp Code for EMNLP'21 paper "Types of Out-of-Distribution Texts and How to Detect Them" Files fine_tune.py is used to finetune the GPT-2 mo

Udit Arora 19 Oct 28, 2022
Low Complexity Channel estimation with Neural Network Solutions

Interpolation-ResNet Invited paper for WSA 2021, called 'Low Complexity Channel estimation with Neural Network Solutions'. Low complexity residual con

Dianxin 10 Dec 10, 2022
天勤量化开发包, 期货量化, 实时行情/历史数据/实盘交易

TqSdk 天勤量化交易策略程序开发包 TqSdk 是一个由信易科技发起并贡献主要代码的开源 python 库. 依托快期多年积累成熟的交易及行情服务器体系, TqSdk 支持用户使用极少的代码量构建各种类型的量化交易策略程序, 并提供包含期货、期权、股票的 历史数据-实时数据-开发调试-策略回测-

信易科技 2.8k Dec 30, 2022
git《Pseudo-ISP: Learning Pseudo In-camera Signal Processing Pipeline from A Color Image Denoiser》(2021) GitHub: [fig5]

Pseudo-ISP: Learning Pseudo In-camera Signal Processing Pipeline from A Color Image Denoiser Abstract The success of deep denoisers on real-world colo

Yue Cao 51 Nov 22, 2022