Espresso: A Fast End-to-End Neural Speech Recognition Toolkit

Last update: Jan 03, 2023

Overview

Espresso

Espresso is an open-source, modular, extensible end-to-end neural automatic speech recognition (ASR) toolkit based on the deep learning library PyTorch and the popular neural machine translation toolkit fairseq. Espresso supports distributed training across GPUs and computing nodes, and features various decoding approaches commonly employed in ASR, including look-ahead word-based language model fusion, for which a fast, parallelized decoder is implemented.

We provide state-of-the-art training recipes for the following speech datasets:

What's New:

April 2021: On-the-fly feature extraction from raw waveforms with torchaudio is supported. A LibriSpeech recipe is released here with no dependency on Kaldi and using YAML files (via Hydra) for configuring experiments.
June 2020: Transformer recipes released.
April 2020: Both E2E LF-MMI (using PyChain) and Cross-Entropy training for hybrid ASR are now supported. WSJ recipes are provided here and here as examples, respectively.
March 2020: SpecAugment is supported and relevant recipes are released.
September 2019: We are in an effort of isolating Espresso from fairseq, resulting in a standalone package that can be directly pip installed.

Requirements and Installation

PyTorch version >= 1.5.0
Python version >= 3.6
For training new models, you'll also need an NVIDIA GPU and NCCL
To install Espresso from source and develop locally:

git clone https://github.com/freewym/espresso
cd espresso
pip install --editable .

# on MacOS:
# CFLAGS="-stdlib=libc++" pip install --editable ./
pip install kaldi_io sentencepiece soundfile
cd espresso/tools; make KALDI=<path/to/a/compiled/kaldi/directory>

add your Python path to PATH variable in examples/asr_<dataset>/path.sh, the current default is ~/anaconda3/bin.

kaldi_io is required for reading kaldi scp files. sentencepiece is required for subword pieces training/encoding. soundfile is required for reading raw waveform files. Kaldi is required for data preparation, feature extraction, scoring for some datasets (e.g., Switchboard), and decoding for all hybrid systems.

If you want to use PyChain for LF-MMI training, you also need to install PyChain (and OpenFst):

edit PYTHON_DIR variable in espresso/tools/Makefile (default: ~/anaconda3/bin), and then

cd espresso/tools; make openfst pychain

For faster training install NVIDIA's apex library:

git clone https://github.com/NVIDIA/apex
cd apex
pip install -v --no-cache-dir --global-option="--cpp_ext" --global-option="--cuda_ext" \
  --global-option="--deprecated_fused_adam" --global-option="--xentropy" \
  --global-option="--fast_multihead_attn" ./

License

Espresso is MIT-licensed.

Citation

Please cite Espresso as:

@inproceedings{wang2019espresso,
  title = {Espresso: A Fast End-to-end Neural Speech Recognition Toolkit},
  author = {Yiming Wang and Tongfei Chen and Hainan Xu 
            and Shuoyang Ding and Hang Lv and Yiwen Shao 
            and Nanyun Peng and Lei Xie and Shinji Watanabe 
            and Sanjeev Khudanpur},
  booktitle = {2019 IEEE Automatic Speech Recognition and Understanding Workshop (ASRU)},
  year = {2019},
}

Espresso: A Fast End-to-End Neural Speech Recognition Toolkit

Related tags

Overview

Espresso

What's New:

Requirements and Installation

License

Citation

Owner

Yiming Wang

Use the state-of-the-art m2m100 to translate large data on CPU/GPU/TPU. Super Easy!

DAGAN - Dual Attention GANs for Semantic Image Synthesis

A collection of Classical Chinese natural language processing models, including Classical Chinese related models and resources on the Internet.

Easy, fast, effective, and automatic g-code compression!

TruthfulQA: Measuring How Models Imitate Human Falsehoods

Code for Findings of ACL 2022 Paper "Sentiment Word Aware Multimodal Refinement for Multimodal Sentiment Analysis with ASR Errors"

Simple Python library, distributed via binary wheels with few direct dependencies, for easily using wav2vec 2.0 models for speech recognition

Pre-training BERT masked language models with custom vocabulary

Code for the paper "Flexible Generation of Natural Language Deductions"

The NewSHead dataset is a multi-doc headline dataset used in NHNet for training a headline summarization model.

Repositório da disciplina no semestre 2021-2

Harvis is designed to automate your C2 Infrastructure.

Natural language processing summarizer using 3 state of the art Transformer models: BERT, GPT2, and T5

IMDB film review sentiment classification based on BERT's supervised learning model.

LSTM based Sentiment Classification using Tensorflow - Amazon Reviews Rating

Words-per-minute - A terminal app written in python utilizing the curses module that tests the user's ability to type

This is a project of data parallel that running on NLP tasks.

Convolutional Neural Networks for Sentence Classification

This repository contains the codes for LipGAN. LipGAN was published as a part of the paper titled "Towards Automatic Face-to-Face Translation".

Some embedding layer implementation using ivy library