Simple and understandable swin-transformer OCR project

Last update: Dec 31, 2022

Overview

swin-transformer-ocr

Overview

Simple and understandable swin-transformer OCR project. The model in this repository heavily relied on high-level open-source projects like timm and x_transformers. And also you can find that the procedure of training is intuitive thanks to the legibility of pytorch-lightning.

The model in this repository encodes input image to context vector with 'shifted-window` which is a swin-transformer encoding mechanism. And it decodes the vector with a normal auto-regressive transformer.

If you are not familiar with transformer OCR structure, transformer-ocr would be easier to understand because it uses a traditional convolution network (ResNet-v2) for the encoder.

Performance

With private korean handwritten text dataset, the accuracy(exact match) is 97.6%.

Data

./dataset/
├─ preprocessed_image/
│  ├─ cropped_image_0.jpg
│  ├─ cropped_image_1.jpg
│  ├─ ...
├─ train.txt
└─ val.txt

# in train.txt
cropped_image_0.jpg\tHello World.
cropped_image_1.jpg\tvision-transformer-ocr
...

You should preprocess the data first. Crop the image by word or sentence level area. Put all image data in a specific directory. Ground truth information should be provided with a txt file. In the txt file, write the image file name and label with \t separator in the same line.

Configuration

In settings/ directory, you can find default.yaml. You can set almost every hyper-parameter in that file. Copy one and edit it as your experiment version. I recommend you to run with the default setting first, before you change it.

Train

python run.py --version 0 --setting settings/default.yaml --num_workers 16 --batch_size 128

You can check your training log with tensorboard.

tensorboard --log_dir tb_logs --bind_all

Predict

When your model finishes training, you can use your model for prediction.

python predict.py --setting <your_setting.yaml> --target <image_or_directory> --tokenizer <your_tokenizer_pkl> --checkpoint <saved_checkpoint>

Exporting to ONNX

You can export your model to ONNX format. It's very easy thanks to pytorch-lightning. See the related pytorch-lightning document.

Citations

@misc{liu-2021,
    title   = {Swin Transformer: Hierarchical Vision Transformer using Shifted Windows},
	author  = {Ze Liu and Yutong Lin and Yue Cao and Han Hu and Yixuan Wei and Zheng Zhang and Stephen Lin and Baining Guo},
	year    = {2021},
    eprint  = {2103.14030},
	archivePrefix = {arXiv}
}

Simple and understandable swin-transformer OCR project

Related tags

Overview

swin-transformer-ocr

Overview

Performance

Data

Configuration

Train

Predict

Exporting to ONNX

Citations

Owner

Ha YongWook

SemiNAS: Semi-Supervised Neural Architecture Search

一个目标检测的通用框架(不需要cuda编译)，支持Yolo全系列(v2~v5)、EfficientDet、RetinaNet、Cascade-RCNN等SOTA网络。

VolumeGAN - 3D-aware Image Synthesis via Learning Structural and Textural Representations

WSDM2022 Challenge - Large scale temporal graph link prediction

A unet implementation for Image semantic segmentation

Kaggle Lyft Motion Prediction for Autonomous Vehicles 4th place solution

Contra is a lightweight, production ready Tensorflow alternative for solving time series prediction challenges with AI

SAS: Self-Augmentation Strategy for Language Model Pre-training

Implementation of ConvMixer for "Patches Are All You Need? 🤷"

Raster Vision is an open source Python framework for building computer vision models on satellite, aerial, and other large imagery sets

A benchmark for the task of translation suggestion

StyleSpace Analysis: Disentangled Controls for StyleGAN Image Generation

Implementation of the Transformer variant proposed in "Transformer Quality in Linear Time"

SpeechBrain is an open-source and all-in-one speech toolkit based on PyTorch.

iris - Open Source Photos Platform Powered by PyTorch

A custom-designed Spider Robot trained to walk using Deep RL in a PyBullet Simulation

PFFDTD is an open-source FDTD simulator for 3D room acoustics

InsightFace: 2D and 3D Face Analysis Project on MXNet and PyTorch

FlowTorch is a PyTorch library for learning and sampling from complex probability distributions using a class of methods called Normalizing Flows

Segmentation in Style: Unsupervised Semantic Image Segmentation with Stylegan and CLIP