LightSeq: A High-Performance Inference Library for Sequence Processing and Generation

Last update: Jan 03, 2023

Overview

LightSeq: A High Performance Inference Library for Sequence Processing and Generation

LightSeq is a high performance inference library for sequence processing and generation implemented in CUDA. It enables highly efficient computation of modern NLP models such as BERT, GPT2, Transformer, etc. It is therefore best useful for Machine Translation, Text Generation, Dialog， Language Modelling, and other related tasks using these models.

The library is built on top of CUDA official library(cuBLAS, Thrust, CUB) and custom kernel functions which are specially fused and optimized for these widely used models. In addition to model components, we also provide codes manage model weights trained from deepleanring framework and servers as a custom backend for TensorRT Inference Server(referred to as TRTIS in the later discussion). With LightSeq, you can easily deploy efficient model services or develop your own model architectures just with a little code modification.

Features

Comprehensive sequence modeling support, including Bert, GPT, Transformer and their VAE variants.
Various search methods, such as beam search, diverse beam search, topp/topk sampling.
Out-of-the-box rich middlewares for model service based on TRTIS, such as dynamic batch, multi-model on single GPU.
State of art inference performance compared with Deeplearning framework and other inference libraries.

The following is a support matrix of LightSeq compared with TurboTransformers and FasterTransformer.

Performance

Here, we show our experimental results on neural machine translation and text generation. The models of these two tasks are Transformer-base, but use beam search and sampling search methods respectively. We choose Tensorflow and FasterTransformer as a comparison. The implementation from tensor2tensor was used as the benchmark of Tensorflow.

More results is available here.

Neural machine translation

Text generation

Code Structure

├── CMakeLists.txt # cmake build file
├── CONTRIBUTING.md 
├── example
│   ├── CMakeLists.txt
│   ├── decoder_example.cc.cu # transformer decoder only example
│   ├── gpt_generation.cc.cu # GPT generation example
│   ├── gptlm_example.cc.cu # GPT language model example
│   ├── transformer_example.cc.cu # Transformer translation example
│   └── transformer_generate_example.cc.cu # Transformer generation example
├── kernels
│   ├── CMakeLists.txt
│   ├── common.h # common kernel functions 
│   ├── gptKernels.cc.cu # GPT kernel functions
│   ├── gptKernels.h
│   ├── transformerKernels.cc.cu # Transformer kernel functions
│   └── transformerKernels.h
├── LICENSE
├── model
│   ├── CMakeLists.txt
│   ├── decoder.cc.cu # Transformer decoder
│   ├── decoder.h
│   ├── encoder.cc.cu # Transformer encoder
│   ├── encoder.h
│   ├── gpt_encoder.cc.cu # GPT encoder
│   └── gpt_encoder.h
├── NOTICE
├── proto
│   ├── CMakeLists.txt
│   ├── gpt.proto # proto file to save GPT model
│   ├── gpt_weight.cc # GPT weight class
│   ├── gpt_weight.h
│   ├── transformer.proto # # proto file to save Transformer model
│   ├── transformer_weight.cc # Transformer weight class
│   └── transformer_weight.h
├── pywrapper
│   ├── CMakeLists.txt
│   ├── transformer.cc.cu # python wrapper for Transformer
│   ├── transformer_decoder.cc.cu # python wrapper for Transformer decoder
│   └── wrapper.cc # pybind registeration
├── README.md
├── server # custom engine for Triton
│   ├── CMakeLists.txt
│   ├── custom.h # Triton dependeny
│   ├── decoder_generate_server.cc.cu
│   ├── generate_server.cc.cu
│   ├── gpt_generate_server.cc.cu
│   ├── gptlm_server.cc.cu
│   ├── libserver.ldscript # Triton dependeny
│   ├── model_config_cuda.h # Triton dependeny
│   ├── model_config.h # Triton dependeny
│   ├── model_config.proto # Triton dependeny
│   └── transformer_server.cc.cu 
└── tools
    ├── CMakeLists.txt
    ├── util.cc.cu
    └── util.h

Quick Start

Run from HuggingFace bart

We provide an end2end bart-base example to see how fast Lightseq is compared to HuggingFace. First you should install these requirements.

pip install torch tensorflow transformers lightseq
cd example/python

then you can check the performance by simply running following commands. hf_bart_export.py is used to transform pytorch weights to LightSeq protobuffer.

python hf_bart_export.py
python ls_bart.py

on our Tesla V100 we can get following output, 47x speedup have been obtained from running LightSeq rather than HuggingFace.

=========================lightseq=========================
lightseq generating...
lightseq time: 0.034502994269132614s
lightseq results:
I love that girl, but she does not love me.
She is so beautiful that I can not help glance at her.
Nothing's gonna change my love for you.
Drop everything now. Meet me in the pouring rain. Kiss me on the sidewalk.
=========================huggingface=========================
huggingface generating...
huggingface time: 1.6297104470431805s
huggingface results:
I love that girl, but she does not love me.
She is so beautiful that I can not help glance at her.
Nothing's gonna change my love for you.
Drop everything now. Meet me in the pouring rain. Kiss me on the sidewalk.

LightSeq installation from pypi only supports python 3.6 to 3.8 on Linux for now. Consider compiling from source if you have other environments.

Run python wrapper

We provide python api to call lightseq, all you need is to install lightseq with pip, and make sure you have GPU driver not older than 418.40.04.

And check these files proto/*.proto to prepare your model weights. We provide an example weight file for you to test.

curl -OL https://github.com/bytedance/lightseq/releases/download/v0.0.1/transformer_weight.tar.gz
tar -zxvf transformer_weight.tar.gz

Finally you can run lightseq in only a few lines!

import lightseq
import numpy as np

test_input = np.array([[5001, 2, 36, 5002]])
transformer = lightseq.Transformer("transformer.pb", 32) # 32 is max batch size, it will decide GPU memory occupancy.
result = transformer.infer(test_input)

Python api doesn't support GPT for now, and we will get it ready as soon as possible.

Run inference server

Requirements

Install Docker and nvidia-docker.
GPU driver version >= 410.48
Login to the NGC registry.

To avoid problems caused by inconsistent environments, you can use the pre-built TRTIS container from NVIDIA GPU Cloud (NGC). To start the given container, you need to install nvidia-docker and make your GPU driver version >= 410.48

docker pull nvcr.io/nvidia/tensorrtserver:19.05-py3
# 
docker run --gpus '"device=0"' -it --rm -p8000:8000 -p8001:8001 -p8002:8002 -v
/${current}/${path}:/quick_start nvcr.io/nvidia/tensorrtserver:19.05-py3 /bin/bash
# inside container
cd /quick_start

Use our pre-build lib

To quickly deploy your model that supported by LightSeq currently, you can download the pre-built libraries from the GitHub release page corresponding to the release version you are interested in. In each release version, we will upload binary executable example and dynamic link library of models which is a custom backend of TRTIS.

wget https://github.com/bytedance/lightseq/releases/download/${VERSION}/${VERSION}_libs.tar.gz
tar -zxvf ${VERSION}_libs.tar.gz

Run local inference demo

To run local inference demo, you need to prepare model weights saved in custom proto defined by LightSeq and input token ids. We provide a GPT-LM model and its corresponding input token ids:

wget https://github.com/bytedance/lightseq/releases/download/v0.0.1/v0.0.1_gptlm.pkg.tar.gz
tar -zxvf v0.0.1_gptlm.pkg.tar.gz
# fp32 example
./{VERSION}_libs/gptlm_example.fp32 ./v0.0.1_gptlm.pkg/gpt.pb ./v0.0.1_gptlm.pkg/test_case
# fp16 example
./{VERSION}_libs/gptlm_example.fp16 ./v0.0.1_gptlm.pkg/gpt.pb ./v0.0.1_gptlm.pkg/test_case

To run the end-to-end model server based on TRTIS, you need to prepare a custom backend model repository like this:

models/
  <model-name>/
    config.pbtxt # configuration
    xxx # model weights
    1/
      libyyy.so # custom dynamic link library

With the pre-built libraries and example weights mentioned above, you can easily run a server:

mkdir -p ./model_zoo/gptlm/1
wget https://github.com/bytedance/lightseq/releases/download/v0.0.1/v0.0.1_gptlm.config.pbtxt
mv v0.0.1_gptlm.config.pbtxt model_zoo/gptlm/config.pbtxt
cp ./v0.0.1_gptlm.pkg/gpt.pb model_zoo/gptlm/gpt.pb
cp ./{VERSION}_libs/libgptlm.so.fp32 model_zoo/gptlm/1/libgptlm.so
# or fp16 server
# cp ./{VERSION}_libs/libgptlm.so.fp16 model_zoo/gptlm/1/libgptlm.so
export MODEL_ZOO="/quick_start/model_zoo"
trtserver --model-store=${MODEL_ZOO}

After starting server, Invoking the TRTIS client will get the inference result.

Serve your own model

In order to serve your own model, you need to export model trained from deeplearning framework(E.g. TenforFlow, PyTorch) to custom model proto defined by LightSeq. Furthermore, you may need to build from source code if you want to modify the model architectures or serve a new model not supported by LightSeq currently.

Limitations and Future Plans

LightSeq does not support CPU inference for now and its compilation relies heavily on TRTIS, we will try to solve these problems in future. Furthermore, the following will be the focus of our future work:

Support more model architectures and decoding search algorithms.
Int8 inference.
Device deployment.

Cite us

Our paper has been accepted by NAACL 2021 (Industry Track). If you use LightSeq in your research publication, please cite this paper.

@article{wang2021lightseq,
      title={LightSeq: A High Performance Inference Library for Transformers}, 
      author={Xiaohui Wang and Ying Xiong and Yang Wei and Mingxuan Wang and Lei Li},
      journal={arXiv preprint arXiv:2010.13887},
      year={2021}
}

Contact

Join the lark group in the blog to reach us instantly (lark registration required).

Any questions or suggestions, please feel free to contact us. [email protected], [email protected], [email protected]

Comments

RuntimeError: Parse weights from [lightseq_bart_base.hdf5] failed

When I tried to run the example case like this

python hf_bart_export.py
python ls_bart.py

It has some errors

initializing bart tokenizer...
creating lightseq model...
Traceback (most recent call last):
  File "ls_bart.py", line 102, in <module>
    main()
  File "ls_bart.py", line 69, in main
    ls_model = lsi.Transformer("lightseq_bart_base.hdf5", 128)
RuntimeError: Parse weights from [lightseq_bart_base.hdf5] failed.

Alright，I tried to run other case , huggingface gpt2 in examples:

python hf_gpt2_export.py
python ls_gpt.py

It had some error again:

initializing gpt tokenizer...
Downloading: 100%|███████████████████████████████████████████████████████| 1.04M/1.04M [00:00<00:00, 1.81MB/s]
Downloading: 100%|█████████████████████████████████████████████████████████| 456k/456k [00:00<00:00, 1.36MB/s]
Downloading: 100%|███████████████████████████████████████████████████████| 1.36M/1.36M [00:00<00:00, 2.29MB/s]
lightseq tokenizer pad token id: 50257
huggingface tokenizer pad token id: 50256
creating lightseq model...
Traceback (most recent call last):
  File "ls_gpt.py", line 119, in <module>
    main()
  File "ls_gpt.py", line 79, in main
    ls_model = lsi.Gpt("lightseq_gpt2_base.hdf5", max_batch_size=16)
TypeError: __init__(): incompatible constructor arguments. The following argument types are supported:
    1. lightseq.inference.Gpt(weight_path: str, max_batch_size: int, max_step: int)

Invoked with: 'lightseq_gpt2_base.hdf5'; kwargs: max_batch_size=16

I don't know how to fixed them. Can you give me some advices. thank you very much.

opened by juha0 21

lightseq inference abnormal using ls_fs_transformer_export.py exported model

Hi, I used the python export/ls_fs_transformer_export.py to export lightseq trained NMT model to do inference, but I found the result is quiet abnormal. These are some details output in the ls_fs_transformer_export.py test part.

generator config beam size: 4 extra decode length(max decode length - src input length): 50 length penalty: 0.6 diverse lambda: 0 sampling method: beam_search topk: 1 topp: 0.75 Allocated 882MB GPU buffer for transformer decoder buffer init start decoder buffer init succeed pb results: (array([[[ 4, 164, 164, 164, 164, 164, 164, 164, 164, 164, 164, 164, 164, 164, 164, 164, 164, 164, 164, 164, 164, 164, 164, 164, 164, 164, 164, 164, 164, 164, 164, 164, 164, 164, 164, 164, 164, 164, 164, 164, 164, 164, 164, 164, 164, 164, 164, 164, 164, 164, 164, 164, 164, 164, 164, 164, 164, 164, 164, 164, 164, 6]]], dtype=int32), array([[0.]], dtype=float32)) hdf5 results: (array([[[ 4, 164, 164, 164, 164, 164, 164, 164, 164, 164, 164, 164, 164, 164, 164, 164, 164, 164, 164, 164, 164, 164, 164, 164, 164, 164, 164, 164, 164, 164, 164, 164, 164, 164, 164, 164, 164, 164, 164, 164, 164, 164, 164, 164, 164, 164, 164, 164, 164, 164, 164, 164, 164, 164, 164, 164, 164, 164, 164, 164, 164, 6]]], dtype=int32), array([[0.]], dtype=float32))

I also tested more examples, and it continued to generate some repeated logits, and when I decoded the array with my tgt_dict, it generated something like this:

thesamesamesamesamesamesamesamesamesamesamesamesamesamesamesamesamesamesamesamesamesamesamesamesamesamesamesamesamesamesamesamesamesamesamesamesamesamesamesamesamesamesamesamesamesamesamesamesamesamesamesamesamesamesamesame.

I used the 0.10.2 fairseq and 2.1.4 lightseq version, and the lightseq-generate result seems normal. I think maybe something wrong happened in the export procedure. Looking forward for your reply.

opened by dearchill 18
fix pos embedding index bug
Fixed the implementation of position embedding

the size of position matrix is determined by max_positions parameter

ignore all the padding tokens when calculating the token position

the position index begin from padding_idx + 1, consistent with fairseq implementation
opened by nomadlx 13

No acceleration compared with timm vit block

I use the code below to test the vit block speed. The output shows the speed is almost the same between pytorch and lightseq

Did I missed something?

Output for forward only:

timm finished 500 running, avg_time: 76.379987 ms light_seq finished 500 running, avg_time: 75.543549 ms

The output for forward + backward:

timm finished 500 running, avg_time: 228.803998 ms light_seq finished 500 running, avg_time: 227.007331 ms

from timm.models.vision_transformer import Block
from lightseq.training.ops.pytorch.transformer_encoder_layer import LSTransformerEncoderLayer
from easydict import EasyDict as edict
import torch.nn as nn
import torch
import time
import sys
sys.path.append('./')


torch.backends.cudnn.benchmark = True


def generate_dummy_data(args):
    inputs = torch.randn([args.bs, args.num_token, args.dim]).cuda()
    return (inputs, )


def get_timm_block(args):
    return Block(
        dim=args.dim,
        num_heads=args.num_heads,
        mlp_ratio=args.mlp_ratio,
        qkv_bias=False,
        drop=False,
        attn_drop=False,
        init_values=None,
        drop_path=0,
        act_layer=nn.GELU,
        norm_layer=nn.LayerNorm
    )

class LSBlockWrapper(LSTransformerEncoderLayer):
    def forward(self, x):
        B, N, C = x.shape
        mask = torch.zeros([B, N, N], device=x.device, dtype=x.dtype)
        return super().forward(x, mask)

def get_ls_block(args):
    config = LSBlockWrapper.get_config(
        max_batch_tokens=args.num_token * args.bs,
        max_seq_len=args.num_token,
        hidden_size=args.dim,
        intermediate_size=int(args.mlp_ratio * args.dim),
        nhead=args.num_heads,
        attn_prob_dropout_ratio=0,
        hidden_dropout_ratio=0,
        activation_dropout_ratio=0,
        pre_layer_norm=True,
        fp16=False,
        local_rank=0,
        activation_fn='gelu')
    return LSBlockWrapper(
            config=config,
            initial_weights=None,
            initial_biases=None
        )


def run(module, args, name='Unknown'):
    inputs = generate_dummy_data(args)

    # cudnn warmup
    for _ in range(50):
        if args.backward:
            module(*inputs).sum().backward()
        else:
            module(*inputs)

    torch.cuda.synchronize()
    t0 = time.time()

    for _ in range(args.num_iter):
        if args.backward:
            module(*inputs).sum().backward()
        else:
            module(*inputs)

    torch.cuda.synchronize()
    t1 = time.time()

    avg_time = (t1 - t0) * 1000 / args.num_iter
    print(
        f'>>> {name} finished {args.num_iter} running, avg_time: {avg_time:.6f} ms')
    return avg_time


def main():
    args = edict()
    args.num_iter = 500
    args.backward = False

    args.bs = 128
    args.dim = 1280
    args.num_heads = 16
    args.mlp_ratio = 4.0
    args.num_token = 256

    timm_block = get_timm_block(args).cuda()
    ls_block = get_ls_block(args).cuda()

    run(timm_block, args, name='timm')
    run(ls_block, args, name='light_seq')

    print('Finished.')

if __name__ == '__main__':
    main()

opened by woolpeeker 11

Gpt exceeds maximum protobuf size of 2GB: 3096122166

when I use lightseq(2.0) export gpt2-large, it raises an error ValueError: Message Gpt exceeds maximum protobuf size of 2GB: 3096122166

hf_gpt2_export.py is as follows


if __name__ == "__main__":
    output_lightseq_model_name = "lightseq_gpt2_large.pb"
    input_huggingface_gpt_model = "gpt2-large"
    head_number = 36
    # generation_method should be "topk" or "topp"
    generation_method = "topk"
    topk = 1
    topp = 0.75
    # default eos_id from https://huggingface.co/transformers/model_doc/gpt2.html#gpt2lmheadmodel
    eos_id = 50256
    pad_id = 50257
    extract_gpt_weights(
        output_lightseq_model_name,
        input_huggingface_gpt_model,
        head_num=head_number,  # layer number
        generation_method=generation_method,
        topk=topk,
        topp=topp,
        eos_id=eos_id,
        pad_id=pad_id,
    )


['transformer.h.34.mlp.c_proj.bias'] -> ffn_second_bias, shape: (1280,), convert finished.
['transformer.h.35.ln_1.weight'] -> multihead_norm_scale, shape: (1280,), convert finished.
['transformer.h.35.ln_1.bias'] -> multihead_norm_bias, shape: (1280,), convert finished.
['transformer.h.35.attn.c_attn.weight'] -> multihead_project_kernel_qkv, shape: (1280, 3840), convert finished.
['transformer.h.35.attn.c_attn.bias'] -> multihead_project_bias_qkv, shape: (3840,), convert finished.
['transformer.h.35.attn.c_proj.weight'] -> multihead_project_kernel_output, shape: (1280, 1280), convert finished.
['transformer.h.35.attn.c_proj.bias'] -> multihead_project_bias_output, shape: (1280,), convert finished.
['transformer.h.35.ln_2.weight'] -> ffn_norm_scale, shape: (1280,), convert finished.
['transformer.h.35.ln_2.bias'] -> ffn_norm_bias, shape: (1280,), convert finished.
['transformer.h.35.mlp.c_fc.weight'] -> ffn_first_kernel, shape: (1280, 5120), convert finished.
['transformer.h.35.mlp.c_fc.bias'] -> ffn_first_bias, shape: (5120,), convert finished.
['transformer.h.35.mlp.c_proj.weight'] -> ffn_second_kernel, shape: (5120, 1280), convert finished.
['transformer.h.35.mlp.c_proj.bias'] -> ffn_second_bias, shape: (1280,), convert finished.
['transformer.ln_f.weight'] -> norm_scale, shape: (1280,), convert finished.
['transformer.ln_f.bias'] -> norm_bias, shape: (1280,), convert finished.
['transformer.wte.weight'] -> token_embedding, shape: (50257, 1280), convert finished.
['transformer.wpe.weight'] -> position_embedding, shape: (1024, 1280), convert finished.
Wrting to lightseq_gpt2_large.pb
Traceback (most recent call last):
  File "hf_gpt2_export.py", line 127, in <module>
    pad_id=pad_id,
  File "hf_gpt2_export.py", line 100, in extract_gpt_weights
    fout.write(gpt.SerializeToString())
ValueError: Message Gpt exceeds maximum protobuf size of 2GB: 3096122166

opened by zmingshi 8

[CUDA][ERROR]: misaligned address

hi, I have 1 question. When a large amount of text requests the model, the model starts to run properly. After the model runs for a period of time, the program reports an error : [CUDA][ERROR] /tmp/build-via-sdist-uagdfpbf/lightseq-2.2.1/lightseq/inference/pywrapper/gpt.cc.cu(160): misaligned address.

opened by fc20567 6
Questions about beam search
Hi guys,

Two questions related beam search confused me and I am looking forward to reply😊。

Your beam search is same as T2T?

length_penalty == 1.0 means no length_penalty?

Thx
opened by gongel 6
Can you provide a docker file that can test training and inference code the lightseq?

I tried to set up LightSeq on docker system (RTX2080TI 4-way or A100 2-way) but failed to set it for 8 hours.

Therefore, please upload Dockerfile or Images using LightSeq system test.

(I tested based on image file nvcr.io/nvidia/pytorch:21.08, 20.12, 20.10, taka23/lightseq .. etc but didn't succeed.)

opened by pdh930105 6
Support for VIT-small (hidden_dim=384)

Hello, thank you for your contribution. I want to replace the encoders in vit-small with LSHFTransformerEncoderLayer. For each encoder, num_attention_heads = 6 and hidden_dim = 384. However, there is an error here says that hidden_dim must be an integer multiple of 256. Why does LSHFTransformerEncoderLayer have this restriction？ Is there any solution to use LSHFTransformerEncoderLayer in vit-small? Correct me if I am wrong. Thanks!

opened by woskii 6
Lightseq model inference for fairseq task after training

Hi, i could not find any details about lightseq model inference for fairseq task after training, did i miss something? I mean after training, the model arch is ls_tranformer, i can't use native fairseq-generate command for inference, and i don't find something like lightseq-generate. I find the examples about inference are huggingface models such as bart and gpt2, and no after-training fairseq model inference documents are provided. Could someone tell me how to do this?

opened by dearchill 6
Example/Support of converting Fairseq Model to run in LightSeq

I am curious of trying LightSeq to speed up my inference for a vanilla Transformer Encoder-Decoder (Vasawani 17) model. My original model was trained with FairSeq (or OpenNMT-py). Is there any example or places that you can refer to help me convert my transformer model to the format compatible of running LightSeq?

opened by pttzty 6

[Question]: How to compile lightseq

I try to compile lightseq by using build.sh, but run into the following problem:

lightseq/csrc/proto/bert_weight.cc:451:15: error: ‘class Bert’ has no member named ‘ParseFromIstream’; did you mean ‘ParseFromString’?
     if (!bert.ParseFromIstream(&raw_input)) {
               ^~~~~~~~~~~~~~~~
               ParseFromString

lightseq/csrc/proto/bert_crf_weight.cc:38:37: error: no match for ‘operator[]’ (operand types are ‘const google::protobuf::RepeatedPtrField<BertCrfEncoderLayer>’ and ‘int’)
   _inner_size = bert.encoder_stack()[0].ffn_first_kernel_size() / _hidden_size;

Branch master and tag v3.0.1 both failed.

Did I miss something? How can I manage to compile this project?

opened by FrostML 0

Possible memory leak in DecSelfAttentionLayer

The constructor creates new objects without shared_ptrs, but the destructor is empty.

In cpp:

DecSelfAttentionLayer<T1, T2>::DecSelfAttentionLayer(
    int layer_id, int max_batch_tokens, int max_seq_len, int hidden_size,
    int num_heads, float attn_prob_dropout_ratio,
    float hidden_output_dropout_ratio, bool pre_or_postLayerNorm,
    bool is_post_ln, bool is_continuous_cache)
    : Layer("DecSelfAttentionLayer"),  // necessary
      _layer_id(layer_id),
      _max_batch_tokens(max_batch_tokens),

     ..............................
      // operators
      _attn_ln(
          new LayerNormalizeOp<T1, T2>(max_batch_tokens, hidden_size, false)),

In header: virtual ~DecSelfAttentionLayer() {}

Not sure if this is by design or missing delete calls in the destructor.

opened by Kangmo 1

Question : About construction of total_cache_k, total_cache_v in Transformer

In lightseq/csrc/models/transformer.cu, Should cache_k_out and cache_v_out call set_ancestor? Otherwise why not remove the unused variable cache_k_out and cache_k_out?

Transformer::Transformer {
  ...
  for (auto iter : dec_layer_vec) {
    Variable *cache_k = new Variable("cache_k");
    Variable *cache_v = new Variable("cache_v");
    std::tuple<Variable *, Variable *, Variable *> dec_outs =
        (*iter)(dec_emb, total_enc_kv, pad_mask, cache_k, cache_v);
    dec_emb = std::get<0>(dec_outs);
    Variable *cache_k_out = std::get<1>(dec_outs);
    Variable *cache_v_out = std::get<2>(dec_outs);

    cache_k->set_ancestor(total_cache_k, cache_size * dec_layer_idx);
    cache_v->set_ancestor(total_cache_v, cache_size * dec_layer_idx);
    dec_layer_idx++;
  }

https://github.com/bytedance/lightseq/blob/2b5592fa658a39a914a5036e665647084d777903/lightseq/csrc/models/transformer.cu#L135

opened by Kangmo 3

LinearOp::forward is getting cublashandle before checking if the context is built.

LinearOp::forward is getting cublashandle before checking if the context is built.

problem : LinearOp::forward is getting cublashandle without checking if context is built. LinearOp::backward is checking if the context is built before getting cublashandle.

solution: Modify LinearOp::forward to check if context is built before getting cublashandle.

opened by Kangmo 0
How to ensemble lightseq models? & the memory usage is too big when generating
I ran into the following two problems when using lightseq3.0.

I pass --path model1:model2 to ensemble model1 and model2 for generation just like fairseq-generate:

lightseq-generate $DATA_PATH \ --path part_1/checkpoint_4_267500.pt:part_1/checkpoint_4_265000.pt \ --batch-size 4 --beam 4 --remove-bpe \ --gen-subset ${name} \ --source-lang en \ --target-lang zh \ --max-len-a 1 \ --max-len-b 50 \ --lenpen 0.6 --fp16

but the operation fails in the middle with the following error(the checkpoints are from the same model).

Could you please suggest an example of ensemble?

When I use lightseq-generate for generation, I found that 10GB of memory is required to load a transformer_big model for lightseq while 2GB of memory is only required to load the same model for fairseq. I wonder if this is as expected?

This is loading a lightseq transformer_big model:

This is loading a fairseq transformer_big model:

Environment

Python 3.7

pytorch 1.12

fairseq 0.10.2

lightseq 3.0
opened by baoguo1995 1

Releases(v2.2.1)

v2.2.1(Dec 6, 2022)

In the hip_dev branch, LightSeq supports CUDA backend and HIP backend（now support training only）. LightSeq transformer has a speedup about 7% comparing with FairsSeq transformer under the HIP backend. LightSeq HIP supports multiple NLP models, such as transformer, bert, gpt, etc. Users need no modification with python training. More information about the LightSeq HIP can be found here https://github.com/bytedance/lightseq/blob/hip_dev/README_HIP.md
Source code(tar.gz)
Source code(zip)
v3.0.1(Nov 2, 2022)
What's Changed

compatible gcq params by @HandH1998 in https://github.com/bytedance/lightseq/pull/409

Fix gpu name by @godweiyang in https://github.com/bytedance/lightseq/pull/415

Full Changelog: https://github.com/bytedance/lightseq/compare/v3.0.0...v3.0.1
Source code(tar.gz)
Source code(zip)
v3.0.0(Oct 25, 2022)

It's been a long time since our last release (v2.2.0). For the past one year, we have focused on int8 quantization.

In this release, LightSeq supports int8 quantized training and inference. Compared with PyTorch QAT, LightSeq int8 training has a speedup of 3x without any performance loss. Compared with previous LightSeq fp16 inference, int8 engine has a speedup up to 1.7x.

LightSeq int8 engine supports multiple models, such as Transformer, BERT, GPT, etc. For int8 training, the users only need to apply quantization mode to the model using model.apply(enable_quant). For int8 inference, the users only need to use QuantTransformer instead of fp16 Transformer.

Other releases include supporting models like MoE, fix bugs, performance improvement, etc.
Source code(tar.gz)
Source code(zip)
v2.2.0(Oct 26, 2021)

Inference

Support more multi-language models #209

Fixes

Fix inference error on HDF5 #208 Fix training error when batch_size=1 #192 Other minor fixes: #205 #202 #193
Source code(tar.gz)
Source code(zip)
v2.1.3(Aug 19, 2021)

This version contains several features and bug fixes.

Training

relax restriction of layer norm hidden size #137 #161 support inference during training for transformer #141 #146 #147

Inference

Add inference support and examples for BERT #145

Fixes

fix save/load for training with pytorch #139 fix pos embedding index bug #144
Source code(tar.gz)
Source code(zip)
v2.1.0(Jul 19, 2021)

This version contains several features and bug fixes.

Training

support BertEncoder #116 support torch amp and apex amp #100

Inference

support big models like gpt2-large and bart-large #82

Fixes

fix adam bug when param size < 1024 #98 fix training compiling fail in cuda < 11 #80
Source code(tar.gz)
Source code(zip)
v2.0.2(Jun 25, 2021)

[inference] fix warp reduce bug in inference. #74
Source code(tar.gz)
Source code(zip)
v2.0.1(Jun 24, 2021)

Merge codes about training and inference. Reorganize docs and README.
Source code(tar.gz)
Source code(zip)
v2.0.0(Jun 20, 2021)
It's been a long time since our last release (v1.2.0). For the past six months, we have focused on training efficiency.

In this release, LightSeq supports fast training for models in the Transformer family!

We provide highly optimized custom operators for PyTorch and TensorFlow, which cover the entire training process for Transformer-based models. Users of LightSeq can use these operators to build their own models with efficient computation.

In addition, we integrate our custom operators into popular training libraries like Fairseq, Hugging Face, NeurST, which enables a 1.5X-3X end-to-end speedup campred to the native version.

With only a small amount of code, you can enjoy the excellent performance provided by LightSeq. Try it now!

Training

support lightseq-train to accelerate fairseq training, including optimized transformer model, adam, and label smoothed loss

huggingface bert training example

neurst transformer training example for Tensorflow users

Inference

support GPT python wrapper

inference APIs are moved to lightseq.inference

This release has API change for inference, all inference API has moved to lightseq.inference. For example, use import lightseq.inference and model = lightseq.inference.Transformer("$PB_PATH", max_batch_size)
Source code(tar.gz)
Source code(zip)
v1.2.0(Dec 24, 2020)

Support Python API and multilingual nmt
Source code(tar.gz)
Source code(zip)
v1.1.0(Oct 29, 2020)

Support sampling/diverse beam search and VAE.
Source code(tar.gz)
Source code(zip)
v1.1.0_libs.tar.gz(59.55 MB)
v1.0.0(Dec 6, 2019)

Byseqlib is a high performance inference library for SOTA NLU/NLG models.
Source code(tar.gz)
Source code(zip)
v1.0.0_libs.tar.gz(21.87 MB)
v0.0.1(Dec 6, 2019)

To provide test model weights ans inputs
Source code(tar.gz)
Source code(zip)
transformer_weight.tar.gz(166.99 MB)
v0.0.1_gptlm.config.pbtxt(293 bytes)
v0.0.1_gptlm.pkg.tar.gz(164.15 MB)

Owner

Bytedance Inc.

GitHub Repository

Speech to text streamlit app

Speech to text Streamlit-app! 👄 This speech to text recognition is powered by t

9 Jan 01, 2023

Unsupervised Abstract Reasoning for Raven’s Problem Matrices

Unsupervised Abstract Reasoning for Raven’s Problem Matrices This code is the implementation of our TIP paper. This is the first unsupervised abstract

9 Dec 17, 2022

The source code of HeCo

HeCo This repo is for source code of KDD 2021 paper "Self-supervised Heterogeneous Graph Neural Network with Co-contrastive Learning". Paper Link: htt

106 Dec 27, 2022

2021搜狐校园文本匹配算法大赛baseline

sohu2021-baseline 2021搜狐校园文本匹配算法大赛baseline 简介分享了一个搜狐文本匹配的baseline，主要是通过条件LayerNorm来增加模型的多样性，以实现同一模型处理不同类型的数据、形成不同输出的目的。线下验证集F1约0.74，线上测试集F1约0.73。

45 Sep 06, 2022

An attempt to map the areas with active conflict in Ukraine using open source twitter data.

Live Action Map (LAM) An attempt to use open source data on Twitter to map areas with active conflict. Right now it is used for the Ukraine-Russia con

171 Nov 21, 2022

Final Project for the Intel AI Readiness Boot Camp NLP (Jan)

NLP Boot Camp (Jan) Synopsis Full Name: Prameya Mohanty Name of your School: Delhi Public School, Rourkela Class: VIII Title of the Project: iTransect

1 Feb 01, 2022

A Semi-Intelligent ChatBot filled with statistical and economical data for the Premier League.

MONEYBALL - ChatBot Module: 4006CEM, Class: B, Group: 5 Contributors: Jonas Djondo Roshan Kc Cole Samson Daniel Rodrigues Ihteshaam Naseer Kind remind

1 Nov 18, 2021

Linking data between GBIF, Biodiverse, and Open Tree of Life

GBIF-biodiverse-OpenTree Linking data between GBIF, Biodiverse, and Open Tree of Life The python scripts will rely on opentree and Dendropy. To set up

2 Oct 03, 2022

超轻量级bert的pytorch版本，大量中文注释，容易修改结构，持续更新

bert4pytorch 2021年8月27更新：感谢大家的star，最近有小伙伴反映了一些小的bug，我也注意到了，奈何这个月工作上实在太忙，更新不及时，大约会在9月中旬集中更新一个只需要pip一下就完全可用的版本，然后会新添加一些关键注释。再增加对抗训练的内容，更新一个完整的finetune

317 Dec 18, 2022

Code for lyric-section-to-comment generation based on huggingface transformers.

CommentGeneration Code for lyric-section-to-comment generation based on huggingface transformers. Migrate Guyu model and code (both 12-layers and 24-l

8 Sep 04, 2021

Accurately generate all possible forms of an English word e.g "election" --> "elect", "electoral", "electorate" etc.

Accurately generate all possible forms of an English word Word forms can accurately generate all possible forms of an English word. It can conjugate v

570 Dec 31, 2022

문장단위로 분절된 나무위키 데이터셋. Releases에서 다운로드 받거나, tfds-korean을 통해 다운로드 받으세요.

Namuwiki corpus 문장단위로 미리 분절된 나무위키 코퍼스. 목적이 LM등에서 사용하기 위한 데이터셋이라, 링크/이미지/테이블 등등이 잘려있습니다. 문장 단위 분절은 kss를 활용하였습니다. 라이선스는 나무위키에 명시된 바와 같이 CC BY-NC-SA 2.0

16 Apr 02, 2022

2021 2학기 데이터크롤링 기말프로젝트

공지 주제 웹 크롤링을 이용한 취업 공고 스케줄러 스케줄 주제 정하기 코딩하기 핵심 코드 설명 + 피피티 구조 구상 // 12/4 토 피피티 + 스크립트(대본) 제작 + 녹화 // ~ 12/10 ~ 12/11 금~토 영상 편집 // ~12/11 토 웹크롤러 사람인_평균

2 Aug 16, 2022

An easy to use Natural Language Processing library and framework for predicting, training, fine-tuning, and serving up state-of-the-art NLP models.

Welcome to AdaptNLP A high level framework and library for running, training, and deploying state-of-the-art Natural Language Processing (NLP) models

407 Jan 03, 2023

LightSeq: A High-Performance Inference Library for Sequence Processing and Generation

Related tags

Overview

LightSeq: A High Performance Inference Library for Sequence Processing and Generation

Features

Performance

Code Structure

Quick Start

Run from HuggingFace bart

Run python wrapper

Run inference server

Requirements

Use our pre-build lib

Run local inference demo

Serve your own model

Limitations and Future Plans

Cite us

Contact

Comments

Releases(v2.2.1)

v2.2.1(Dec 6, 2022)

v3.0.1(Nov 2, 2022)

What's Changed

v3.0.0(Oct 25, 2022)

v2.2.0(Oct 26, 2021)

Inference

Fixes

v2.1.3(Aug 19, 2021)

Training

Inference

Fixes

v2.1.0(Jul 19, 2021)

Training

Inference

Fixes

v2.0.2(Jun 25, 2021)

v2.0.1(Jun 24, 2021)

v2.0.0(Jun 20, 2021)

Training

Inference

v1.2.0(Dec 24, 2020)

v1.1.0(Oct 29, 2020)

v1.0.0(Dec 6, 2019)

v0.0.1(Dec 6, 2019)

Owner

Bytedance Inc.

Speech to text streamlit app

Unsupervised Abstract Reasoning for Raven’s Problem Matrices

The source code of HeCo

2021搜狐校园文本匹配算法大赛baseline

An attempt to map the areas with active conflict in Ukraine using open source twitter data.

Final Project for the Intel AI Readiness Boot Camp NLP (Jan)

A Semi-Intelligent ChatBot filled with statistical and economical data for the Premier League.

Linking data between GBIF, Biodiverse, and Open Tree of Life

超轻量级bert的pytorch版本，大量中文注释，容易修改结构，持续更新

Code for lyric-section-to-comment generation based on huggingface transformers.

Accurately generate all possible forms of an English word e.g "election" --> "elect", "electoral", "electorate" etc.

문장단위로 분절된 나무위키 데이터셋. Releases에서 다운로드 받거나, tfds-korean을 통해 다운로드 받으세요.

2021 2학기 데이터크롤링 기말프로젝트

An easy to use Natural Language Processing library and framework for predicting, training, fine-tuning, and serving up state-of-the-art NLP models.

基于Transformer的单模型、多尺度的VAE模型

DeBERTa: Decoding-enhanced BERT with Disentangled Attention

Nystromformer: A Nystrom-based Algorithm for Approximating Self-Attention

An open collection of annotated voices in Japanese language

Ελληνικά νέα (Python script) / Greek News Feed (Python script)

Repositório da disciplina no semestre 2021-2