PyTorch Implementation of Google Brain's WaveGrad 2: Iterative Refinement for Text-to-Speech Synthesis

Last update: Dec 06, 2022

Overview

WaveGrad2 - PyTorch Implementation

PyTorch Implementation of Google Brain's WaveGrad 2: Iterative Refinement for Text-to-Speech Synthesis.

Status (2021.06.22)

Working on

Quickstart

Dependencies

You can install the Python dependencies with

pip3 install -r requirements.txt

Inference

You have to download the pretrained models and put them in output/ckpt/LJSpeech/.

For English single-speaker TTS, run

python3 synthesize.py --text "YOUR_DESIRED_TEXT" --restore_step 900000 --mode single -p config/LJSpeech/preprocess.yaml -m config/LJSpeech/model.yaml -t config/LJSpeech/train.yaml

The generated utterances will be put in output/result/.

Batch Inference

Batch inference is also supported, try

python3 synthesize.py --source preprocessed_data/LJSpeech/val.txt --restore_step 900000 --mode batch -p config/LJSpeech/preprocess.yaml -m config/LJSpeech/model.yaml -t config/LJSpeech/train.yaml

to synthesize all utterances in preprocessed_data/LJSpeech/val.txt

Controllability

The speaking rate of the synthesized utterances can be controlled by specifying the desired duration ratios. For example, one can increase the speaking rate by 20 % by

python3 synthesize.py --text "YOUR_DESIRED_TEXT" --restore_step 900000 --mode single -p config/LJSpeech/preprocess.yaml -m config/LJSpeech/model.yaml -t config/LJSpeech/train.yaml --duration_control 0.8

Training

Datasets

The supported datasets are

LJSpeech: a single-speaker English dataset consists of 13100 short audio clips of a female speaker reading passages from 7 non-fiction books, approximately 24 hours in total.
(will be added more)

Preprocessing

First, run

python3 prepare_align.py config/LJSpeech/preprocess.yaml

for some preparations.

As described in the paper, Montreal Forced Aligner (MFA) is used to obtain the alignments between the utterances and the phoneme sequences. Alignments for the LJSpeech datasets are provided here (thanks to ming024's FastSpeech2). You have to unzip the files in preprocessed_data/LJSpeech/TextGrid/.

After that, run the preprocessing script by

python3 preprocess.py config/LJSpeech/preprocess.yaml

Alternately, you can align the corpus by yourself. Download the official MFA package and run

./montreal-forced-aligner/bin/mfa_align raw_data/LJSpeech/ lexicon/librispeech-lexicon.txt english preprocessed_data/LJSpeech

./montreal-forced-aligner/bin/mfa_train_and_align raw_data/LJSpeech/ lexicon/librispeech-lexicon.txt preprocessed_data/LJSpeech

to align the corpus and then run the preprocessing script.

python3 preprocess.py config/LJSpeech/preprocess.yaml

Training

Train your model with

python3 train.py -p config/LJSpeech/preprocess.yaml -m config/LJSpeech/model.yaml -t config/LJSpeech/train.yaml

TensorBoard

Use

tensorboard --logdir output/log/LJSpeech

to serve TensorBoard on your localhost.

Implementation Issues

Use 22050Hz instead of 24KHz and follow general LJSpeech configurations.
Add nn.ReLU() activation at the end of the duration predictor to force the value positive.
Follow the Aligher of EATS: End-to-End Adversarial Text-to-Speech for the Gaussian upsampling, rather than that of Non-Attentive Tacotron.

Citation

@misc{lee2021wavegrad2,
  author = {Lee, Keon},
  title = {WaveGrad2},
  year = {2021},
  publisher = {GitHub},
  journal = {GitHub repository},
  howpublished = {\url{https://github.com/keonlee9420/WaveGrad2}}
}

References

You might also like...

PyTorch Implementation of Daft-Exprt: Robust Prosody Transfer Across Speakers for Expressive Speech Synthesis

Comments

About gaussian up-sampling

Hi @keonlee9420 , you have mentioned you don't use Non-Attentive Tacotron's upsampling rather prefer EATS upsampling, is there some specific reason for that ? Also what's the inference speed of WaveGrad v2 on cpu ?

opened by rishikksh20 1
samplingwindow correct, range param masked_fill with small value

Mask fill the range with non zero value solved the range problem. (It was for our case, Should be tested for this case) In sampling window, pad with sliced one.

opened by Seungwoo0326 0

PyTorch Implementation of Google Brain's WaveGrad 2: Iterative Refinement for Text-to-Speech Synthesis

Related tags

Overview

WaveGrad2 - PyTorch Implementation

Status (2021.06.22)

Quickstart

Dependencies

Inference

Batch Inference

Controllability

Training

Datasets

Preprocessing

Training

TensorBoard

Implementation Issues

Citation

References

You might also like...

PyTorch Implementation of Daft-Exprt: Robust Prosody Transfer Across Speakers for Expressive Speech Synthesis

Official implementation of deep Gaussian process (DGP)-based multi-speaker speech synthesis with PyTorch.

PyTorch implementation of Tacotron speech synthesis model.

A PyTorch implementation of the WaveGlow: A Flow-based Generative Network for Speech Synthesis

PyTorch implementation of Lip to Speech Synthesis with Visual Context Attentional GAN (NeurIPS2021)

Implementation of Perceiver, General Perception with Iterative Attention, in Pytorch

[CVPR 2021] Official PyTorch Implementation for "Iterative Filter Adaptive Network for Single Image Defocus Deblurring"

Read Like Humans: Autonomous, Bidirectional and Iterative Language Modeling for Scene Text Recognition

PyTorch 1.5 implementation for paper DECOR-GAN: 3D Shape Detailization by Conditional Refinement.

Comments

About gaussian up-sampling

samplingwindow correct, range param masked_fill with small value

Releases(v1.0.0)

v1.0.0(Aug 3, 2021)

v0.1.0(Jul 17, 2021)

Owner

Keon Lee

A simple baseline for 3d human pose estimation in PyTorch.

LinkNet - This repository contains our Torch7 implementation of the network developed by us at e-Lab.

Regularizing Generative Adversarial Networks under Limited Data (CVPR 2021)

Source code for the paper "SEPP: Similarity Estimation of Predicted Probabilities for Defending and Detecting Adversarial Text" PACLIC 2021

[NeurIPS 2021] Garment4D: Garment Reconstruction from Point Cloud Sequences

Trax — Deep Learning with Clear Code and Speed

Capture all information throughout your model's development in a reproducible way and tie results directly to the model code!

SpecAugmentPyTorch - A Pytorch (support batch and channel) implementation of GoogleBrain's SpecAugment: A Simple Data Augmentation Method for Automatic Speech Recognition

DISTIL: Deep dIverSified inTeractIve Learning.

Supplementary materials to "Spin-optomechanical quantum interface enabled by an ultrasmall mechanical and optical mode volume cavity" by H. Raniwala, S. Krastanov, M. Eichenfield, and D. R. Englund, 2022

Code for the paper "Graph Attention Tracking". (CVPR2021)

Pytorch implementation of

Streamlit component for TensorBoard, TensorFlow's visualization toolkit

Recurrent Variational Autoencoder that generates sequential data implemented with pytorch

Training, generation, and analysis code for Learning Particle Physics by Example: Location-Aware Generative Adversarial Networks for Physics

Code to go with the paper "Decentralized Bayesian Learning with Metropolis-Adjusted Hamiltonian Monte Carlo"

Repo for our ICML21 paper Unsupervised Learning of Visual 3D Keypoints for Control

A framework for GPU based high-performance medical image processing and visualization

Implementation of Continuous Sparsification, a method for pruning and ticket search in deep networks

Fashion Landmark Estimation with HRNet