Performance-Efficiency Trade-offs in Unsupervised Pre-training for Speech Recognition

Related tags

Text Data & NLPsew
Overview

SEW (Squeezed and Efficient Wav2vec)

made-with-python License: MIT

The repo contains the code of the paper "Performance-Efficiency Trade-offs in Unsupervised Pre-training for Speech Recognition" by Felix Wu, Kwangyoun Kim, Jing Pan, Kyu Han, Kilian Q Weinberger, and Yoav Artzi.

Model Checkpoints

Unsupervisedly Pre-trained on LibriSpeech 960h

Model Pre-training updates Dataset Model
W2V2-tiny 100K Librispeech 960h download
W2V2-small 100K Librispeech 960h download
W2V2-mid 100K Librispeech 960h download
W2V2-base 100K Librispeech 960h download
SEW-tiny 100K Librispeech 960h download
SEW-small 100K Librispeech 960h download
SEW-mid 100K Librispeech 960h download
SEW-D-tiny 100K Librispeech 960h download
SEW-D-small 100K Librispeech 960h download
SEW-D-mid 100K Librispeech 960h download
SEW-D-mid (k127) 100K Librispeech 960h download
SEW-D-base 100K Librispeech 960h download
SEW-D-base+ 100K Librispeech 960h download
SEW-D-mid 400K Librispeech 960h download
SEW-D-mid (k127) 400K Librispeech 960h download
SEW-D-base+ 400K Librispeech 960h download

ASR model fine-tuned on LibriSpeech train-clean 100h

Model Pre-training updates Finetuning split Model
SEW-tiny 100K 100h download
SEW-D-tiny 100K 100h download
SEW-D-mid 400K 100h download
SEW-D-mid (k127) 400K 100h download
SEW-D-base+ 400K 100h download

Usage

Dependencies

The code is tested with fairseq commit 05255f9, deberta commit bf17ca4 and the following packages.

torch==1.8.0
torchaudio==0.8.0
tqdm==4.49.0
Hydra==2.5
hydra-core==1.0.4
fvcore==0.1.5.post20210330
omegaconf==2.0.5
einops==0.3.0
fire==0.2.1

Apex

Please install NVIDIA's apex with

git clone https://github.com/NVIDIA/apex
cd apex
pip install -v --no-cache-dir --global-option="--cpp_ext" --global-option="--cuda_ext" \
  --global-option="--deprecated_fused_adam" --global-option="--xentropy" \
  --global-option="--fast_multihead_attn" ./

wav2letter decoder

Currently, we are decoding with wav2letter v0.2 python binding at commit 96f5f9d Please install the python binding here https://github.com/flashlight/wav2letter/tree/96f5f9d3b41e01af0a031ee0d2604acd9ef3b1b0/bindings/python The newest commit d5a93f0 in v0.2 branch leads to worse WER for wav2vec 2.0 baselines.

Installation

git clone https://github.com/asappresearch/sew.git
cd sew 
pip install -e .

Pre-training

Pre-training SEW models

Run the following command where $model_size can be tiny, small, or mid, and $ngpu is tne number of GPUs you want to use.

bash scripts/pt-sew.sh $model_size $ngpu

Pre-training SEW-D models

bash scripts/pt-sew-d.sh $model_size $ngpu

where $model_size can be tiny, small, mid, mid-k127, base, or base+.

Fine-tuning

Run the following script to fine-tune a model with the hyperparameters from wav2vec 2.0.

bash scripts/ft-model.sh $pre_trained_model $split $ngpu

where $pre_trained_model can be either a W2V2, SEW, or a SEW-D model checkpoint and $split can be 10m, 1h, 10h, or 100h.

Here we also provide a set of hyperparameters which sets all dropouts the same as the pre-training stage, and we found it to be more stable.

bash scripts/ft-model-stable.sh $pre_trained_model $split $ngpu

If you see out of GPU memory error, please scale down the dataset.max_tokens and scale up the optimization.update_freq in scripts/ft-model.sh. For example modifying these lines

  dataset.max_tokens=3200000 \
  optimization.update_freq="[$((8 / $ngpu))]" \

to

  dataset.max_tokens=1600000 \
  optimization.update_freq="[$((16 / $ngpu))]" \

which reduces the batch size and increases the gradient accumulation steps in order to use less GPU memory.

Evaluation

  1. Please run this script to prepare the official LibriSpeech 4-gram language model.
bash scripts/prepare_librispeech_lm.sh $kenlm_build_bin

where $kenlm_build_bin is the folder that contains the KenLM build_binary executable file (e.g. /home/user/kenlm/build/bin).

  1. Then run this script to evaluate a pre-trained ASR model
python tools/eval_w2v.py tunelm --subsets '["dev-clean", "dev-other", "test-clean", "test-other"]' --model $asr_checkpoint
You might also like...
Neural building blocks for speaker diarization: speech activity detection, speaker change detection, overlapped speech detection, speaker embedding
Neural building blocks for speaker diarization: speech activity detection, speaker change detection, overlapped speech detection, speaker embedding

⚠️ Checkout develop branch to see what is coming in pyannote.audio 2.0: a much smaller and cleaner codebase Python-first API (the good old pyannote-au

PyTorch implementation of Microsoft's text-to-speech system FastSpeech 2: Fast and High-Quality End-to-End Text to Speech.
PyTorch implementation of Microsoft's text-to-speech system FastSpeech 2: Fast and High-Quality End-to-End Text to Speech.

An implementation of Microsoft's "FastSpeech 2: Fast and High-Quality End-to-End Text to Speech"

Simple Speech to Text, Text to Speech

Simple Speech to Text, Text to Speech 1. Download Repository Opsi 1 Download repository ini, extract di lokasi yang diinginkan Opsi 2 Jika sudah famil

Code for ACL 2022 main conference paper "STEMM: Self-learning with Speech-text Manifold Mixup for Speech Translation".

STEMM: Self-learning with Speech-Text Manifold Mixup for Speech Translation This is a PyTorch implementation for the ACL 2022 main conference paper ST

Unsupervised intent recognition

INTENT author: steeve LAQUITAINE description: deployment pattern: currently batch only Setup & run git clone https://github.com/slq0/intent.git bash

Implementaion of our ACL 2022 paper Bridging the Data Gap between Training and Inference for Unsupervised Neural Machine Translation

Bridging the Data Gap between Training and Inference for Unsupervised Neural Machine Translation This is the implementaion of our paper: Bridging the

PhoNLP: A BERT-based multi-task learning toolkit for part-of-speech tagging, named entity recognition and dependency parsing
PhoNLP: A BERT-based multi-task learning toolkit for part-of-speech tagging, named entity recognition and dependency parsing

PhoNLP is a multi-task learning model for joint part-of-speech (POS) tagging, named entity recognition (NER) and dependency parsing. Experiments on Vietnamese benchmark datasets show that PhoNLP produces state-of-the-art results, outperforming a single-task learning approach that fine-tunes the pre-trained Vietnamese language model PhoBERT for each task independently.

Bidirectional LSTM-CRF and ELMo for Named-Entity Recognition, Part-of-Speech Tagging and so on.
Bidirectional LSTM-CRF and ELMo for Named-Entity Recognition, Part-of-Speech Tagging and so on.

anaGo anaGo is a Python library for sequence labeling(NER, PoS Tagging,...), implemented in Keras. anaGo can solve sequence labeling tasks such as nam

Bidirectional LSTM-CRF and ELMo for Named-Entity Recognition, Part-of-Speech Tagging and so on.
Bidirectional LSTM-CRF and ELMo for Named-Entity Recognition, Part-of-Speech Tagging and so on.

anaGo anaGo is a Python library for sequence labeling(NER, PoS Tagging,...), implemented in Keras. anaGo can solve sequence labeling tasks such as nam

Comments
  • 8000 sample rate audio

    8000 sample rate audio

    Hello there,

    I'm trying to train on 8000 Hz sample rate audio dataset. Is it enough to simply add task.sample_rate=8000 to the fairseq command or there are additional config changes that I should make?

    I would much appreciate any advice

    Thank you

    opened by Mega4alik 0
  • How to train using not English Languages

    How to train using not English Languages

    Hi! Thank you for the awesome model!

    We are very interested in your project and we try to use the sew for Japanese Language. When we train the model, should we use these scripts? Thanks! https://github.com/asappresearch/sew/tree/master/scripts

    opened by jigenji 1
  • :bug: Fix padding mask calculation

    :bug: Fix padding mask calculation

    This PR updates the padding mask calculation to be the same as the one in the reference Wav2Vec2 implementation (same commit as listed in SEW's README): https://github.com/pytorch/fairseq/blob/05255f96410e5b1eaf3bf59b767d5b4b7e2c3a35/fairseq/models/wav2vec/wav2vec2.py#L477

    For more details on how and why it was fixed in fairseq, check out this PR by @patrickvonplaten https://github.com/pytorch/fairseq/pull/3228

    opened by anton-l 0
Releases(v0.0.1)
Owner
ASAPP Research
AI for Enterprise
ASAPP Research
PyTorch implementation of convolutional neural networks-based text-to-speech synthesis models

Deepvoice3_pytorch PyTorch implementation of convolutional networks-based text-to-speech synthesis models: arXiv:1710.07654: Deep Voice 3: Scaling Tex

Ryuichi Yamamoto 1.8k Dec 30, 2022
Coreference resolution for English, French, German and Polish, optimised for limited training data and easily extensible for further languages

Coreferee Author: Richard Paul Hudson, Explosion AI 1. Introduction 1.1 The basic idea 1.2 Getting started 1.2.1 English 1.2.2 French 1.2.3 German 1.2

Explosion 70 Dec 12, 2022
Tools and data for measuring the popularity & growth of various programming languages.

growth-data Tools and data for measuring the popularity & growth of various programming languages. Install the dependencies $ pip install -r requireme

3 Jan 06, 2022
TPlinker for NER 中文/英文命名实体识别

本项目是参考 TPLinker 中HandshakingTagging思想,将TPLinker由原来的关系抽取(RE)模型修改为命名实体识别(NER)模型。

GodK 113 Dec 28, 2022
Crie tokens de autenticação íntegros e seguros com UToken.

UToken - Tokens seguros. UToken (ou Unhandleable Token) é uma bilioteca criada para ser utilizada na geração de tokens seguros e íntegros, ou seja, nã

Jaedson Silva 0 Nov 29, 2022
Code release for "COTR: Correspondence Transformer for Matching Across Images"

COTR: Correspondence Transformer for Matching Across Images This repository contains the inference code for COTR. We plan to release the training code

UBC Computer Vision Group 358 Dec 24, 2022
Natural Language Processing with transformers

we want to create a repo to illustrate usage of transformers in chinese

Datawhale 763 Dec 27, 2022
Training and evaluation codes for the BertGen paper (ACL-IJCNLP 2021)

BERTGEN This repository is the implementation of the paper "BERTGEN: Multi-task Generation through BERT" (https://arxiv.org/abs/2106.03484). The codeb

<a href=[email protected]"> 9 Oct 26, 2022
Rich Prosody Diversity Modelling with Phone-level Mixture Density Network

Phone Level Mixture Density Network for TTS This repo contains pytorch implementation of paper Rich Prosody Diversity Modelling with Phone-level Mixtu

Rishikesh (ऋषिकेश) 42 Dec 13, 2022
EMNLP'2021: Can Language Models be Biomedical Knowledge Bases?

BioLAMA BioLAMA is biomedical factual knowledge triples for probing biomedical LMs. The triples are collected and pre-processed from three sources: CT

DMIS Laboratory - Korea University 41 Nov 18, 2022
NLPShala , the best IDE for all Natural language processing tasks.

The revolutionary IDE for all NLP (Natural language processing) stuffs on the internet.

Abhi 3 Aug 08, 2021
Understand Text Summarization and create your own summarizer in python

Automatic summarization is the process of shortening a text document with software, in order to create a summary with the major points of the original document. Technologies that can make a coherent

Sreekanth M 1 Oct 18, 2022
Official PyTorch implementation of SegFormer

SegFormer: Simple and Efficient Design for Semantic Segmentation with Transformers Figure 1: Performance of SegFormer-B0 to SegFormer-B5. Project page

NVIDIA Research Projects 1.4k Dec 29, 2022
[NeurIPS 2021] Code for Learning Signal-Agnostic Manifolds of Neural Fields

Learning Signal-Agnostic Manifolds of Neural Fields This is the uncleaned code for the paper Learning Signal-Agnostic Manifolds of Neural Fields. The

60 Dec 12, 2022
Arabic speech recognition, classification and text-to-speech.

klaam Arabic speech recognition, classification and text-to-speech using many advanced models like wave2vec and fastspeech2. This repository allows tr

ARBML 177 Dec 27, 2022
Sentence boundary disambiguation tool for Japanese texts (日本語文境界判定器)

Bunkai Bunkai is a sentence boundary (SB) disambiguation tool for Japanese texts. Quick Start $ pip install bunkai $ echo -e '宿を予約しました♪!まだ2ヶ月も先だけど。早すぎ

Megagon Labs 160 Dec 23, 2022
The official code for “DocTr: Document Image Transformer for Geometric Unwarping and Illumination Correction”, ACM MM, Oral Paper, 2021.

Good news! Our new work exhibits state-of-the-art performances on DocUNet benchmark dataset: DocScanner: Robust Document Image Rectification with Prog

Hao Feng 231 Dec 26, 2022
Code for using and evaluating SpanBERT.

SpanBERT This repository contains code and models for the paper: SpanBERT: Improving Pre-training by Representing and Predicting Spans. If you prefer

Meta Research 798 Dec 30, 2022
Natural language processing summarizer using 3 state of the art Transformer models: BERT, GPT2, and T5

NLP-Summarizer Natural language processing summarizer using 3 state of the art Transformer models: BERT, GPT2, and T5 This project aimed to provide in

Samuel Sharkey 1 Feb 07, 2022
code for modular summarization work published in ACL2021 by Krishna et al

This repository contains the code for running modular summarization pipelines as described in the publication Krishna K, Khosla K, Bigham J, Lipton ZC

Approximately Correct Machine Intelligence (ACMI) Lab 21 Nov 24, 2022