Multispeaker & Emotional TTS based on Tacotron 2 and Waveglow

Overview

Multispeaker & Emotional TTS based on Tacotron 2 and Waveglow

Table of Contents

General description

This Repository contains a sample code for Tacotron 2, WaveGlow with multi-speaker, emotion embeddings together with a script for data preprocessing.
Checkpoints and code originate from following sources:

Done:

  • took all the best code parts from all of the 5 sources above
  • clean the code and fixed some of the mistakes
  • change code structure
  • add multi-speaker and emotion embendings
  • add preprocessing
  • move all the configs from command line args into experiment config file under configs/experiments folder
  • add restoring / checkpointing mechanism
  • add tensorboard
  • make decoder work with n > 1 frames per step
  • make training work at FP16

TODO:

  • make it work with pytorch-1.4.0
  • add multi-spot instance training for AWS

Getting Started

The following section lists the requirements in order to start training the Tacotron 2 and WaveGlow models.

Clone the repository:

git clone https://github.com/ide8/tacotron2  
cd tacotron2
PROJDIR=$(pwd)
export PYTHONPATH=$PROJDIR:$PYTHONPATH

Requirements

This repository contains Dockerfile which extends the PyTorch NGC container and encapsulates some dependencies. Aside from these dependencies, ensure you have the following components:

Setup

Build an image from Docker file:

docker build --tag taco .

Run docker container:

docker run --shm-size=8G --runtime=nvidia -v /absolute/path/to/your/code:/app -v /absolute/path/to/your/training_data:/mnt/train -v /absolute/path/to/your/logs:/mnt/logs -v /absolute/path/to/your/raw-data:/mnt/raw-data -v /absolute/path/to/your/pretrained-checkpoint:/mnt/pretrained -detach taco sleep inf

Check container id:

docker ps

Select container id of image with tag taco and log into container with:

docker exec -it container_id bash

Code structure description

Folders tacotron2 and waveglow have scripts for Tacotron 2, WaveGlow models and consist of:

  • /model.py - model architecture
  • /data_function.py - data loading functions
  • /loss_function.py - loss function

Folder common contains common layers for both models (common/layers.py), utils (common/utils.py) and audio processing (common/audio_processing.py and common/stft.py).

Folder router is used by training script to select an appropriate model

In the root directory:

  • train.py - script for model training
  • preprocess.py - performs audio processing and creates training and validation datasets
  • inference.ipynb - notebook for running inference

Folder configs contains __init__.py with all parameters needed for training and data processing. Folder configs/experiments consists of all the experiments. waveglow.py and tacotron2.py are provided as examples for WaveGlow and Tacotron 2. On training or data processing start, parameters are copied from your experiment (in our case - from waveglow.py or from tacotron2.py) to __init__.py, from which they are used by the system.

Data preprocessing

Preparing for data preprocessing

  1. For each speaker you have to have a folder named with speaker name, containing wavs folder and metadata.csv file with the next line format: file_name.wav|text.
  2. All necessary parameters for preprocessing should be set in configs/experiments/waveglow.py or in configs/experiments/tacotron2.py, in the class PreprocessingConfig.
  3. If you're running preprocessing first time, set start_from_preprocessed flag to False. preprocess.py performs trimming of audio files up to PreprocessingConfig.top_db (cuts the silence in the beginning and the end), applies ffmpeg command in order to mono, make same sampling rate and bit rate for all the wavs in dataset.
  4. It saves a folder wavs with processed audio files and data.csv file in PreprocessingConfig.output_directory with the following format: path|text|speaker_name|speaker_id|emotion|text_len|duration.
  5. Trimming and ffmpeg command are applied only to speakers, for which flag process_audio is True. Speakers with flag emotion_present is False, are treated as with emotion neutral-normal.
  6. You won't need start_from_preprocessed = False once you finish running preprocessing script. Only exception in case of new raw data comes in.
  7. Once start_from_preprocessed is set to True, script loads file data.csv (created by the start_from_preprocessed = False run), and forms train.txt and val.txt out from data.csv.
  8. Main PreprocessingConfig parameters:
    1. cpus - defines number of cores for batch generator
    2. sr - defines sample ratio for reading and writing audio
    3. emo_id_map - dictionary for emotion name to emotion_id mapping
    4. data[{'path'}] - is path to folder named with speaker name and containing wavs folder and metadata.csv with the following line format: file_name.wav|text|emotion (optional)
  9. Preprocessing script forms training and validation datasets in the following way:
    1. selects rows with audio duration and text length less or equal those for speaker PreprocessingConfig.limit_by (this step is needed for proper batch size)
    2. if such speaker is not present, than it selects rows within PreprocessingConfig.text_limit and PreprocessingConfig.dur_limit. Lower limit for audio is defined by PreprocessingConfig.minimum_viable_dur
    3. in order to be able to use the same batch size as NVIDIA guys, set PreprocessingConfig.text_limit to linda_jonson
    4. splits dataset randomly by ratio train : val = 0.95 : 0.05
    5. if speaker train set is bigger than PreprocessingConfig.n - samples n rows
    6. saves train.txt and val.txt to PreprocessingConfig.output_directory
    7. saves emotion_coefficients.json and speaker_coefficients.json with coefficients for loss balancing (used by train.py).

Run preprocessing

Since both scripts waveglow.py and tacotron2.py contain the class PreprocessingConfig, training and validation dataset can be produced by running any of them:

python preprocess.py --exp tacotron2

or

python preprocess.py --exp waveglow

Training

Preparing for training

Tacotron 2

In configs/experiment/tacotron2.py, in the class Config set:

  1. training_files and validation_files - paths to train.txt, val.txt;
  2. tacotron_checkpoint - path to pretrained Tacotron 2 if it exist (we were able to restore Waveglow from Nvidia, but Tacotron 2 code was edited to add speakers and emotions, so Tacotron 2 needs to be trained from scratch);
  3. speaker_coefficients - path to speaker_coefficients.json;
  4. emotion_coefficients - path to emotion_coefficients.json;
  5. output_directory - path for writing logs and checkpoints;
  6. use_emotions - flag indicating emotions usage;
  7. use_loss_coefficients - flag indicating loss scaling due to possible data disbalance in terms of both speakers and emotions; for balancing loss, set paths to jsons with coefficients in emotion_coefficients and speaker_coefficients;
  8. model_name - "Tacotron2".
  • Launch training
    • Single gpu:
      python train.py --exp tacotron2
      
    • Multigpu training:
      python -m multiproc train.py --exp tacotron2
      

WaveGlow:

In configs/experiment/waveglow.py, in the class Config set:

  1. training_files and validation_files - paths to train.txt, val.txt;
  2. waveglow_checkpoint - path to pretrained Waveglow, restored from Nvidia. Download checkopoint.
  3. output_directory - path for writing logs and checkpoints;
  4. use_emotions - False;
  5. use_loss_coefficients - False;
  6. model_name - "WaveGlow".
  • Launch training
    • Single gpu:
      python train.py --exp waveglow
      
    • Multigpu training:
      python -m multiproc train.py --exp waveglow
      

Running Tensorboard

Once you made your model start training, you might want to see some progress of training:

docker ps

Select container id of image with tag taco and run:

docker exec -it container_id bash

Start Tensorboard:

 tensorboard --logdir=path_to_folder_with_logs --host=0.0.0.0

Loss is being written into tensorboard:

Tensorboard Scalars

Audio samples together with attention alignments are saved into tensorbaord each Config.epochs_per_checkpoint. Transcripts for audios are listed in Config.phrases

Tensorboard Audio

Inference

Running inference with the inference.ipynb notebook.

Run Jupyter Notebook:

jupyter notebook --ip 0.0.0.0 --port 6006 --no-browser --allow-root

output:

[email protected]:/app# jupyter notebook --ip 0.0.0.0 --port 6006 --no-browser --allow-root
[I 09:31:25.393 NotebookApp] JupyterLab extension loaded from /opt/conda/lib/python3.6/site-packages/jupyterlab
[I 09:31:25.393 NotebookApp] JupyterLab application directory is /opt/conda/share/jupyter/lab
[I 09:31:25.395 NotebookApp] Serving notebooks from local directory: /app
[I 09:31:25.395 NotebookApp] The Jupyter Notebook is running at:
[I 09:31:25.395 NotebookApp] http://(04096a19c266 or 127.0.0.1):6006/?token=bbd413aef225c1394be3b9de144242075e651bea937eecce
[I 09:31:25.395 NotebookApp] Use Control-C to stop this server and shut down all kernels (twice to skip confirmation).
[C 09:31:25.398 NotebookApp] 
    
    To access the notebook, open this file in a browser:
        file:///root/.local/share/jupyter/runtime/nbserver-15398-open.html
    Or copy and paste one of these URLs:
        http://(04096a19c266 or 127.0.0.1):6006/?token=bbd413aef225c1394be3b9de144242075e651bea937eecce

Select adress with 127.0.0.1 and put it in the browser. In this case: http://127.0.0.1:6006/?token=bbd413aef225c1394be3b9de144242075e651bea937eecce

This script takes text as input and runs Tacotron 2 and then WaveGlow inference to produce an audio file. It requires pre-trained checkpoints from Tacotron 2 and WaveGlow models, input text, speaker_id and emotion_id.

Change paths to checkpoints of pretrained Tacotron 2 and WaveGlow in the cell [2] of the inference.ipynb.
Write a text to be displayed in the cell [7] of the inference.ipynb.

Parameters

In this section, we list the most important hyperparameters, together with their default values that are used to train Tacotron 2 and WaveGlow models.

Shared parameters

  • epochs - number of epochs (Tacotron 2: 1501, WaveGlow: 1001)
  • learning-rate - learning rate (Tacotron 2: 1e-3, WaveGlow: 1e-4)
  • batch-size - batch size (Tacotron 2: 64, WaveGlow: 11)
  • grad_clip_thresh - gradient clipping treshold (0.1)

Shared audio/STFT parameters

  • sampling-rate - sampling rate in Hz of input and output audio (22050)
  • filter-length - (1024)
  • hop-length - hop length for FFT, i.e., sample stride between consecutive FFTs (256)
  • win-length - window size for FFT (1024)
  • mel-fmin - lowest frequency in Hz (0.0)
  • mel-fmax - highest frequency in Hz (8.000)

Tacotron parameters

  • anneal-steps - epochs at which to anneal the learning rate (500/ 1000/ 1500)
  • anneal-factor - factor by which to anneal the learning rate (0.1) These two parameters are used to change learning rate at the points defined in anneal-steps according to:
    learning_rate = learning_rate * ( anneal_factor ** p),
    where p = 0 at the first step and increments by 1 each step.

WaveGlow parameters

  • segment-length - segment length of input audio processed by the neural network (8000). Before passing to input, audio is padded or croped to segment-length.
  • wn_config - dictionary with parameters of affine coupling layers. Contains n_layers, n_chanels, kernel_size.

Contributing

If you've ever wanted to contribute to open source, and a great cause, now is your chance!

See the contributing docs for more information

Owner
Ivan Didur
CTO at data root labs
Ivan Didur
An implementation of model parallel GPT-2 and GPT-3-style models using the mesh-tensorflow library.

GPT Neo 🎉 1T or bust my dudes 🎉 An implementation of model & data parallel GPT3-like models using the mesh-tensorflow library. If you're just here t

EleutherAI 6.7k Dec 28, 2022
Neural-Machine-Translation - Implementation of revolutionary machine translation models

Neural Machine Translation Framework: PyTorch Repository contaning my implementa

Utkarsh Jain 1 Feb 17, 2022
Code associated with the Don't Stop Pretraining ACL 2020 paper

dont-stop-pretraining Code associated with the Don't Stop Pretraining ACL 2020 paper Citation @inproceedings{dontstoppretraining2020, author = {Suchi

AI2 449 Jan 04, 2023
An end to end ASR Transformer model training repo

END TO END ASR TRANSFORMER 本项目基于transformer 6*encoder+6*decoder的基本结构构造的端到端的语音识别系统 Model Instructions 1.数据准备: 自行下载数据,遵循文件结构如下: ├── data │ ├── train │

旷视天元 MegEngine 10 Jul 19, 2022
ACL22 paper: Imputing Out-of-Vocabulary Embeddings with LOVE Makes Language Models Robust with Little Cost

Imputing Out-of-Vocabulary Embeddings with LOVE Makes Language Models Robust with Little Cost LOVE is accpeted by ACL22 main conference as a long pape

Lihu Chen 32 Jan 03, 2023
Code for CodeT5: a new code-aware pre-trained encoder-decoder model.

CodeT5: Identifier-aware Unified Pre-trained Encoder-Decoder Models for Code Understanding and Generation This is the official PyTorch implementation

Salesforce 564 Jan 08, 2023
A calibre plugin that generates Word Wise and X-Ray files then sends them to Kindle. Supports KFX, AZW3 and MOBI eBooks. X-Ray supports 18 languages.

WordDumb A calibre plugin that generates Word Wise and X-Ray files then sends them to Kindle. Supports KFX, AZW3 and MOBI eBooks. Languages X-Ray supp

172 Dec 29, 2022
Extracting Summary Knowledge Graphs from Long Documents

GraphSum This repo contains the data and code for the G2G model in the paper: Extracting Summary Knowledge Graphs from Long Documents. The other basel

Zeqiu (Ellen) Wu 10 Oct 21, 2022
Toward a Visual Concept Vocabulary for GAN Latent Space, ICCV 2021

Toward a Visual Concept Vocabulary for GAN Latent Space Code and data from the ICCV 2021 paper Sarah Schwettmann, Evan Hernandez, David Bau, Samuel Kl

Sarah Schwettmann 13 Dec 23, 2022
The PyTorch based implementation of continuous integrate-and-fire (CIF) module.

CIF-PyTorch This is a PyTorch based implementation of continuous integrate-and-fire (CIF) module for end-to-end (E2E) automatic speech recognition (AS

Minglun Han 24 Dec 29, 2022
Translators - is a library which aims to bring free, multiple, enjoyable translation to individuals and students in Python

Translators - is a library which aims to bring free, multiple, enjoyable translation to individuals and students in Python

UlionTse 907 Dec 27, 2022
Diaformer: Automatic Diagnosis via Symptoms Sequence Generation

Diaformer Diaformer: Automatic Diagnosis via Symptoms Sequence Generation (AAAI 2022) Diaformer is an efficient model for automatic diagnosis via symp

Junying Chen 20 Dec 13, 2022
FewCLUE: 为中文NLP定制的小样本学习测评基准

FewCLUE: 为中文NLP定制的小样本学习测评基准

CLUE benchmark 387 Jan 04, 2023
Official code of our work, Unified Pre-training for Program Understanding and Generation [NAACL 2021].

PLBART Code pre-release of our work, Unified Pre-training for Program Understanding and Generation accepted at NAACL 2021. Note. A detailed documentat

Wasi Ahmad 138 Dec 30, 2022
내부 작업용 django + vue(vuetify) boilerplate. 짠 하면 돌아감.

Pocket Galaxy 아주 간단한 개인용, 혹은 내부용 툴을 만들어야하는데 이왕이면 웹이 편하죠? 그럴때를 위해 만들어둔 django와 vue(vuetify)로 이뤄진 boilerplate 입니다. 각 폴더에 있는 설명서대로 실행을 시키면 일단 당장 뭔가가 돌아갑니

Jamie J. Seol 16 Dec 03, 2021
Simple text to phones converter for multiple languages

Phonemizer -- foʊnmaɪzɚ The phonemizer allows simple phonemization of words and texts in many languages. Provides both the phonemize command-line tool

CoML 762 Dec 29, 2022
Beyond Accuracy: Behavioral Testing of NLP models with CheckList

CheckList This repository contains code for testing NLP Models as described in the following paper: Beyond Accuracy: Behavioral Testing of NLP models

Marco Tulio Correia Ribeiro 1.8k Dec 28, 2022
Predicting the usefulness of reviews given the review text and metadata surrounding the reviews.

Predicting Yelp Review Quality Table of Contents Introduction Motivation Goal and Central Questions The Data Data Storage and ETL EDA Data Pipeline Da

Jeff Johannsen 3 Nov 27, 2022
Script to generate VAD dataset used in Asteroid recipe

About the dataset LibriVAD is an open source dataset for voice activity detection in noisy environments. It is derived from LibriSpeech signals (clean

11 Sep 15, 2022
This project uses unsupervised machine learning to identify correlations between daily inoculation rates in the USA and twitter sentiment in regards to COVID-19.

Twitter COVID-19 Sentiment Analysis Members: Christopher Bach | Khalid Hamid Fallous | Jay Hirpara | Jing Tang | Graham Thomas | David Wetherhold Pro

4 Oct 15, 2022