Implementation of DocFormer: End-to-End Transformer for Document Understanding, a multi-modal transformer based architecture for the task of Visual Document Understanding (VDU)

Last update: Jan 06, 2023

Related tags

Deep Learning docformer

Overview

DocFormer - PyTorch

Implementation of DocFormer: End-to-End Transformer for Document Understanding, a multi-modal transformer based architecture for the task of Visual Document Understanding (VDU) 📄 📄 📄 .

DocFormer is a multi-modal transformer based architecture for the task of Visual Document Understanding (VDU). In addition, DocFormer is pre-trained in an unsupervised fashion using carefully designed tasks which encourage multi-modal interaction. DocFormer uses text, vision and spatial features and combines them using a novel multi-modal self-attention layer. DocFormer also shares learned spatial embeddings across modalities which makes it easy for the model to correlate text to visual tokens and vice versa. DocFormer is evaluated on 4 different datasets each with strong baselines. DocFormer achieves state-of-the-art results on all of them, sometimes beating models 4x its size (in no. of parameters).

The official implementation was not released by the authors.

Install

There might be some issues with the import of pytessaract, so in order to debug that, we need to write

pip install pytesseract
sudo apt install tesseract-ocr

And then,

pip install git+https://github.com/shabie/docformer

Usage

from docformer import modeling, dataset
from transformers import BertTokenizerFast


config = {
  "coordinate_size": 96,
  "hidden_dropout_prob": 0.1,
  "hidden_size": 768,
  "image_feature_pool_shape": [7, 7, 256],
  "intermediate_ff_size_factor": 4,
  "max_2d_position_embeddings": 1000,
  "max_position_embeddings": 512,
  "max_relative_positions": 8,
  "num_attention_heads": 12,
  "num_hidden_layers": 12,
  "pad_token_id": 0,
  "shape_size": 96,
  "vocab_size": 30522,
  "layer_norm_eps": 1e-12,
}

fp = "filepath/to/the/image.tif"

tokenizer = BertTokenizerFast.from_pretrained("bert-base-uncased")
encoding = dataset.create_features(fp, tokenizer)

feature_extractor = modeling.ExtractFeatures(config)
docformer = modeling.DocFormerEncoder(config)
v_bar, t_bar, v_bar_s, t_bar_s = feature_extractor(encoding)
output = docformer(v_bar, t_bar, v_bar_s, t_bar_s)  # shape (1, 512, 768)

License

MIT

Maintainers

Contribute

Citations

@InProceedings{Appalaraju_2021_ICCV,
    author    = {Appalaraju, Srikar and Jasani, Bhavan and Kota, Bhargava Urala and Xie, Yusheng and Manmatha, R.},
    title     = {DocFormer: End-to-End Transformer for Document Understanding},
    booktitle = {Proceedings of the IEEE/CVF International Conference on Computer Vision (ICCV)},
    month     = {October},
    year      = {2021},
    pages     = {993-1003}
}

Implementation of DocFormer: End-to-End Transformer for Document Understanding, a multi-modal transformer based architecture for the task of Visual Document Understanding (VDU)

Related tags

Overview

DocFormer - PyTorch

Install

Usage

License

Maintainers

Contribute

Citations

Owner

Code for "Layered Neural Rendering for Retiming People in Video."

YOLOv4 / Scaled-YOLOv4 / YOLO - Neural Networks for Object Detection (Windows and Linux version of Darknet )

Code for the paper: Hierarchical Reinforcement Learning With Timed Subgoals, published at NeurIPS 2021

YOLOv3 in PyTorch > ONNX > CoreML > TFLite

TimeSHAP explains Recurrent Neural Network predictions.

DeepHawkeye is a library to detect unusual patterns in images using features from pretrained neural networks

Implementations of orthogonal and semi-orthogonal convolutions in the Fourier domain with applications to adversarial robustness

Create Data & AI apps in 20 lines of code with Shimoku

Clustering is a popular approach to detect patterns in unlabeled data

xitorch: differentiable scientific computing library

Python implementation of MULTIseq barcode alignment using fuzzy string matching and GMM barcode assignment

Implementation of Pooling by Sliced-Wasserstein Embedding (NeurIPS 2021)

Sarus implementation of classical ML models. The models are implemented using the Keras API of tensorflow 2. Vizualization are implemented and can be seen in tensorboard.

A python library for time-series smoothing and outlier detection in a vectorized way.

Leveraging Social Influence based on Users Activity Centers for Point-of-Interest Recommendation

Implementation of Stochastic Image-to-Video Synthesis using cINNs.

Code for the paper "Generative design of breakwaters usign deep convolutional neural network as a surrogate model"

PyTorch implementation for Partially View-aligned Representation Learning with Noise-robust Contrastive Loss (CVPR 2021)

Zero-shot Learning by Generating Task-specific Adapters

Convolutional neural network that analyzes self-generated images in a variety of languages to find etymological similarities