Comprehensive-E2E-TTS - PyTorch Implementation

Last update: Nov 13, 2022

Overview

Comprehensive-E2E-TTS - PyTorch Implementation

A Non-Autoregressive End-to-End Text-to-Speech (generating waveform given text), supporting a family of SOTA unsupervised duration modelings. This project grows with the research community, aiming to achieve the ultimate E2E-TTS. Any suggestions toward the best End-to-End TTS are welcome :)

Quickstart

DATASET refers to the names of datasets such as LJSpeech and VCTK in the following documents.

Dependencies

You can install the Python dependencies with

pip3 install -r requirements.txt

Also, Dockerfile is provided for Docker users.

Inference

You have to download the pretrained models (will be shared soon) and put them in output/ckpt/DATASET/.

For a single-speaker TTS, run

python3 synthesize.py --text "YOUR_DESIRED_TEXT" --restore_step RESTORE_STEP --mode single --dataset DATASET

For a multi-speaker TTS, run

python3 synthesize.py --text "YOUR_DESIRED_TEXT" --speaker_id SPEAKER_ID --restore_step RESTORE_STEP --mode single --dataset DATASET

The dictionary of learned speakers can be found at preprocessed_data/DATASET/speakers.json, and the generated utterances will be put in output/result/.

Batch Inference

Batch inference is also supported, try

python3 synthesize.py --source preprocessed_data/DATASET/val.txt --restore_step RESTORE_STEP --mode batch --dataset DATASET

to synthesize all utterances in preprocessed_data/DATASET/val.txt.

Controllability

The pitch/volume/speaking rate of the synthesized utterances can be controlled by specifying the desired pitch/energy/duration ratios. For example, one can increase the speaking rate by 20 % and decrease the volume by 20 % by

python3 synthesize.py --text "YOUR_DESIRED_TEXT" --restore_step RESTORE_STEP --mode single --dataset DATASET --duration_control 0.8 --energy_control 0.8

Add --speaker_id SPEAKER_ID for a multi-speaker TTS.

Training

Datasets

The supported datasets are

LJSpeech: a single-speaker English dataset consists of 13100 short audio clips of a female speaker reading passages from 7 non-fiction books, approximately 24 hours in total.
VCTK: The CSTR VCTK Corpus includes speech data uttered by 110 English speakers (multi-speaker TTS) with various accents. Each speaker reads out about 400 sentences, which were selected from a newspaper, the rainbow passage and an elicitation paragraph used for the speech accent archive.

Any of both single-speaker TTS dataset (e.g., Blizzard Challenge 2013) and multi-speaker TTS dataset (e.g., LibriTTS) can be added following LJSpeech and VCTK, respectively. Moreover, your own language and dataset can be adapted following here.

Preprocessing

For a multi-speaker TTS with external speaker embedder, download ResCNN Softmax+Triplet pretrained model of philipperemy's DeepSpeaker for the speaker embedding and locate it in ./deepspeaker/pretrained_models/.

Run the preprocessing script by

python3 preprocess.py --dataset DATASET

Training

Train your model with

python3 train.py --dataset DATASET

Useful options:

The trainer assumes single-node multi-GPU training. To use specific GPUs, specify CUDA_VISIBLE_DEVICES= at the beginning of the above command.

TensorBoard

Use

tensorboard --logdir output/log

to serve TensorBoard on your localhost.

Notes

Two options for embedding for the multi-speaker TTS setting: training speaker embedder from scratch or using a pre-trained philipperemy's DeepSpeaker model (as STYLER did). You can toggle it by setting the config (between 'none' and 'DeepSpeaker').
DeepSpeaker on VCTK dataset shows clear identification among speakers. The following figure shows the T-SNE plot of extracted speaker embedding.

Citation

Please cite this repository by the "Cite this repository" of About section (top right of the main page).

Comprehensive-E2E-TTS - PyTorch Implementation

Related tags

Overview

Comprehensive-E2E-TTS - PyTorch Implementation

Architecture Design

Linguistic Encoder

Audio Upsampler

Duration Modeling

Quickstart

Dependencies

Inference

Batch Inference

Controllability

Training

Datasets

Preprocessing

Training

TensorBoard

Notes

Citation

References

Owner

Keon Lee

Implementation of COCO-LM, Correcting and Contrasting Text Sequences for Language Model Pretraining, in Pytorch

ChatterBot is a machine learning, conversational dialog engine for creating chat bots

ADCS - Automatic Defect Classification System (ADCS) for SSMC

🦅 Pretrained BigBird Model for Korean (up to 4096 tokens)

Segmenter - Transformer for Semantic Segmentation

Code for hyperboloid embeddings for knowledge graph entities

source code for paper: WhiteningBERT: An Easy Unsupervised Sentence Embedding Approach.

SIGIR'22 paper: Axiomatically Regularized Pre-training for Ad hoc Search

A crowdsourced dataset of dialogues grounded in social contexts involving utilization of commonsense.

PyTorch Implementation of the paper Single Image Texture Translation for Data Augmentation

NeurIPS'21: Probabilistic Margins for Instance Reweighting in Adversarial Training (Pytorch implementation).

NLP techniques such as named entity recognition, sentiment analysis, topic modeling, text classification with Python to predict sentiment and rating of drug from user reviews.

Code for the paper "VisualBERT: A Simple and Performant Baseline for Vision and Language"

Code for paper "Role-oriented Network Embedding Based on Adversarial Learning between Higher-order and Local Features"

Sploitus - Command line search tool for sploitus.com. Think searchsploit, but with more POCs

Code for Findings at EMNLP 2021 paper: "Learn Continually, Generalize Rapidly: Lifelong Knowledge Accumulation for Few-shot Learning"

MASS: Masked Sequence to Sequence Pre-training for Language Generation

Translators - is a library which aims to bring free, multiple, enjoyable translation to individuals and students in Python

AI-Broad-casting - AI Broad casting with python

Simple, Pythonic, text processing--Sentiment analysis, part-of-speech tagging, noun phrase extraction, translation, and more.