Convert BART models to ONNX with quantization. 3X reduction in size, and upto 3X boost in inference speed

Last update: Dec 09, 2022

Related tags

Deep Learning fast-Bart

Overview

fast-Bart

Reduction of BART model size by 3X, and boost in inference speed up to 3X

BART implementation of the fastT5 library (https://github.com/Ki6an/fastT5)

Pytorch model -> ONNX model -> Quantized ONNX model

Install

Install using requirements.txt file

git clone https://github.com/siddharth-sharma7/fast-Bart
cd fast-Bart
pip install -r requirements.txt

Usage

The export_and_get_onnx_model() method exports the given pretrained Bart model to onnx, quantizes it and runs it on the onnxruntime with default settings. The returned model from this method supports the generate() method of huggingface.

If you don't wish to quantize the model then use quantized=False in the method.

from fastBart import export_and_get_onnx_model
from transformers import AutoTokenizer

model_name = 'facebook/bart-base'
model = export_and_get_onnx_model(model_name)

tokenizer = AutoTokenizer.from_pretrained(model_name)
input = "This is a very long sentence and needs to be summarized."
token = tokenizer(input, return_tensors='pt')

tokens = model.generate(input_ids=token['input_ids'],
               attention_mask=token['attention_mask'],
               num_beams=3)

output = tokenizer.decode(tokens.squeeze(), skip_special_tokens=True)
print(output)

to run the already exported model use get_onnx_model()

you can customize the whole pipeline as shown in the below code example:

from fastBart import (OnnxBart, get_onnx_runtime_sessions,
                    generate_onnx_representation, quantize)
from transformers import AutoTokenizer

model_or_model_path = 'facebook/bart-base'

# Step 1. convert huggingfaces bart model to onnx
onnx_model_paths = generate_onnx_representation(model_or_model_path)

# Step 2. (recommended) quantize the converted model for fast inference and to reduce model size.
# The process is slow for the decoder and init-decoder onnx files (can take up to 15 mins)
quant_model_paths = quantize(onnx_model_paths)

# step 3. setup onnx runtime
model_sessions = get_onnx_runtime_sessions(quant_model_paths)

# step 4. get the onnx model
model = OnnxBart(model_or_model_path, model_sessions)

                      ...

custom output paths

By default, fastBart creates a models-bart folder in the current directory and stores all the models. You can provide a custom path for a folder to store the exported models. And to run already exported models that are stored in a custom folder path: use get_onnx_model(onnx_models_path="/path/to/custom/folder/")

from fastBart import export_and_get_onnx_model, get_onnx_model

model_name = "facebook/bart-base"
custom_output_path = "/path/to/custom/folder/"

# 1. stores models to custom_output_path
model = export_and_get_onnx_model(model_name, custom_output_path)

# 2. run already exported models that are stored in custom path
# model = get_onnx_model(model_name, custom_output_path)

Functionalities

Export any pretrained Bart model to ONNX easily.
The exported model supports beam search and greedy search and more via generate() method.
Reduce the model size by 3X using quantization.
Up to 3X speedup compared to PyTorch execution for greedy search and 2-3X for beam search.

Convert BART models to ONNX with quantization. 3X reduction in size, and upto 3X boost in inference speed

Related tags

Overview

fast-Bart

Reduction of BART model size by 3X, and boost in inference speed up to 3X

Install

Usage

custom output paths

Functionalities

Owner

Siddharth Sharma

QKeras: a quantization deep learning library for Tensorflow Keras

TensorFlow (Python API) implementation of Neural Style

UnsupervisedR&R: Unsupervised Pointcloud Registration via Differentiable Rendering

Code for Ditto: Building Digital Twins of Articulated Objects from Interaction

Data loaders and abstractions for text and NLP

Benchmarking Pipeline for Prediction of Protein-Protein Interactions

Code for sound field predictions in domains with impedance boundaries. Used for generating results from the paper

[NeurIPS 2019] Learning Imbalanced Datasets with Label-Distribution-Aware Margin Loss

Project ArXiv Citation Network

DeepStochlog Package For Python

Multi-Output Gaussian Process Toolkit

[CVPRW 21] "BNN - BN = ? Training Binary Neural Networks without Batch Normalization", Tianlong Chen, Zhenyu Zhang, Xu Ouyang, Zechun Liu, Zhiqiang Shen, Zhangyang Wang

Credit fraud detection in Python using a Jupyter Notebook

pcnaDeep integrates cutting-edge detection techniques with tracking and cell cycle resolving models.

Predicting Auction Sale Price using the kaggle bulldozer auction sales data: Modeling with Ensembles vs Neural Network

Cosine Annealing With Warmup

Pytorch implementation of Each Part Matters: Local Patterns Facilitate Cross-view Geo-localization https://arxiv.org/abs/2008.11646

A machine learning benchmark of in-the-wild distribution shifts, with data loaders, evaluators, and default models.

Python Environment for Bayesian Learning

This repo. is an implementation of ACFFNet, which is accepted for in Image and Vision Computing.