Unofficial PyTorch Implementation of UnivNet: A Neural Vocoder with Multi-Resolution Spectrogram Discriminators for High-Fidelity Waveform Generation

Overview

UnivNet

UnivNet: A Neural Vocoder with Multi-Resolution Spectrogram Discriminators for High-Fidelity Waveform Generation

This is an unofficial PyTorch implementation of Jang et al. (Kakao), UnivNet.

arXiv githubio License

To-Do List

  • Release checkpoint of pre-trained model
  • Extract wav samples for audio sample page
  • Add results including validation loss graph

Key Features

  • According to the authors of the paper, UnivNet obtained the best objective results among the recent GAN-based neural vocoders (including HiFi-GAN) as well as outperforming HiFi-GAN in a subjective evaluation. Also its inference speed is 1.5 times faster than HiFi-GAN.

  • This repository uses the same mel-spectrogram function as the Official HiFi-GAN, which is compatible with NVIDIA/tacotron2.

  • Our default mel calculation hyperparameters are as below, following the original paper.

    audio:
      n_mel_channels: 100
      filter_length: 1024
      hop_length: 256 # WARNING: this can't be changed.
      win_length: 1024
      sampling_rate: 24000
      mel_fmin: 0.0
      mel_fmax: 12000.0

    You can modify the hyperparameters to be compatible with your acoustic model.

Prerequisites

The implementation needs following dependencies.

  1. Python 3.6
  2. PyTorch 1.6.0
  3. NumPy 1.17.4 and SciPy 1.5.4
  4. Install other dependencies in requirements.txt.
    pip install -r requirements.txt

Datasets

Preparing Data

  • Download the training dataset. This can be any wav file with sampling rate 24,000Hz. The original paper used LibriTTS.
    • LibriTTS train-clean-360 split tar.gz link
    • Unzip and place its contents under datasets/LibriTTS/train-clean-360.
  • If you want to use wav files with a different sampling rate, please edit the configuration file (see below).

Note: The mel-spectrograms calculated from audio file will be saved as **.mel at first, and then loaded from disk afterwards.

Preparing Metadata

Following the format from NVIDIA/tacotron2, the metadata should be formatted as:

path_to_wav|transcript|speaker_id
path_to_wav|transcript|speaker_id
...

Train/validation metadata for LibriTTS train-clean-360 split and are already prepared in datasets/metadata. 5% of the train-clean-360 utterances were randomly sampled for validation.

Since this model is a vocoder, the transcripts are NOT used during training.

Train

Preparing Configuration Files

  • Run cp config/default.yaml config/config.yaml and then edit config.yaml

  • Write down the root path of train/validation in the data section. The data loader parses list of files within the path recursively.

    data:
      train_dir: 'datasets/'	# root path of train data (either relative/absoulte path is ok)
      train_meta: 'metadata/libritts_train_clean_360_train.txt'	# relative path of metadata file from train_dir
      val_dir: 'datasets/'		# root path of validation data
      val_meta: 'metadata/libritts_train_clean_360_val.txt'		# relative path of metadata file from val_dir

    We provide the default metadata for LibriTTS train-clean-360 split.

  • Modify channel_size in gen to switch between UnivNet-c16 and c32.

    gen:
      noise_dim: 64
      channel_size: 32 # 32 or 16
      dilations: [1, 3, 9, 27]
      strides: [8, 8, 4]
      lReLU_slope: 0.2

Training

python trainer.py -c CONFIG_YAML_FILE -n NAME_OF_THE_RUN

Tensorboard

tensorboard --logdir logs/

If you are running tensorboard on a remote machine, you can open the tensorboard page by adding --bind_all option.

Inference

python inference.py -p CHECKPOINT_PATH -i INPUT_MEL_PATH

Pre-trained Model

A pre-trained model will be released soon. The model was trained on LibriTTS train-clean-360 split.

Results

See audio samples at https://mindslab-ai.github.io/univnet/

Comparison with the results on paper

Model MOS PESQ(↑) RMSE(↓)
Recordings 4.16±0.09 4.50 0.000
Results in Paper (UnivNet-c32) 3.93±0.09 3.70 0.316
Ours (UnivNet-c32) - TBD TBD

Note

This code is an unofficial implementation, there may be some differences from the original paper.

  • Our UnivNet generator has smaller number of parameters (c32: 5.11M, c16: 1.42M) than the paper (c32: 14.89M, c16: 4.00M). So far, we have not encountered any issues from using a smaller model size. If run into any problem, please report it as an issue.

Implementation Authors

Implementation authors are:

Special thanks to

License

This code is licensed under BSD 3-Clause License.

We referred following codes and repositories.

References

Papers

Datasets

Owner
MINDs Lab
MINDsLab provides AI platform and various AI engines based on deep machine learning.
MINDs Lab
An End-to-End Machine Learning Library to Optimize AUC (AUROC, AUPRC).

Logo by Zhuoning Yuan LibAUC: A Machine Learning Library for AUC Optimization Website | Updates | Installation | Tutorial | Research | Github LibAUC a

Optimization for AI 176 Jan 07, 2023
NeurIPS 2021 Datasets and Benchmarks Track

AP-10K: A Benchmark for Animal Pose Estimation in the Wild Introduction | Updates | Overview | Download | Training Code | Key Questions | License Intr

AP-10K 82 Dec 11, 2022
Code & Data for the Paper "Time Masking for Temporal Language Models", WSDM 2022

Time Masking for Temporal Language Models This repository provides a reference implementation of the paper: Time Masking for Temporal Language Models

Guy Rosin 12 Jan 06, 2023
Reimplementation of NeurIPS'19: "Meta-Weight-Net: Learning an Explicit Mapping For Sample Weighting" by Shu et al.

[Re] Meta-Weight-Net: Learning an Explicit Mapping For Sample Weighting Reimplementation of NeurIPS'19: "Meta-Weight-Net: Learning an Explicit Mapping

Robert Cedergren 1 Mar 13, 2020
Image based Human Fall Detection

Here I integrated the YOLOv5 object detection algorithm with my own created dataset which consists of human activity images to achieve low cost, high accuracy, and real-time computing requirements

UTTEJ KUMAR 12 Dec 11, 2022
Software for Multimodalty 2D+3D Facial Expression Recognition (FER) UI

EmotionUI Software for Multimodalty 2D+3D Facial Expression Recognition (FER) UI. demo screenshot (with RealSense) required packages Python = 3.6 num

Yang Jiao 2 Dec 23, 2021
Code accompanying the paper "Knowledge Base Completion Meets Transfer Learning"

Knowledge Base Completion Meets Transfer Learning This code accompanies the paper Knowledge Base Completion Meets Transfer Learning published at EMNLP

14 Nov 27, 2022
PyTorch implementation of the paper Deep Networks from the Principle of Rate Reduction

Deep Networks from the Principle of Rate Reduction This repository is the official PyTorch implementation of the paper Deep Networks from the Principl

459 Dec 27, 2022
Generative code template for PixelBeasts 10k NFT project.

generator-template Generative code template for combining transparent png attributes into 10,000 unique images. Used for the PixelBeasts 10k NFT proje

Yohei Nakajima 9 Aug 24, 2022
Fashion Landmark Estimation with HRNet

HRNet for Fashion Landmark Estimation (Modified from deep-high-resolution-net.pytorch) Introduction This code applies the HRNet (Deep High-Resolution

SVIP Lab 91 Dec 26, 2022
PPLNN is a Primitive Library for Neural Network is a high-performance deep-learning inference engine for efficient AI inferencing

PPLNN is a Primitive Library for Neural Network is a high-performance deep-learning inference engine for efficient AI inferencing

943 Jan 07, 2023
Official implementation for Likelihood Regret: An Out-of-Distribution Detection Score For Variational Auto-encoder at NeurIPS 2020

Likelihood-Regret Official implementation of Likelihood Regret: An Out-of-Distribution Detection Score For Variational Auto-encoder at NeurIPS 2020. T

Xavier 33 Oct 12, 2022
Code for ECCV 2020 paper "Contacts and Human Dynamics from Monocular Video".

Contact and Human Dynamics from Monocular Video This is the official implementation for the ECCV 2020 spotlight paper by Davis Rempe, Leonidas J. Guib

Davis Rempe 207 Jan 05, 2023
Code Release for ICCV 2021 (oral), "AdaFit: Rethinking Learning-based Normal Estimation on Point Clouds"

AdaFit: Rethinking Learning-based Normal Estimation on Point Clouds (ICCV 2021 oral) **Project Page | Arxiv ** Runsong Zhu¹, Yuan Liu², Zhen Dong¹, Te

40 Dec 30, 2022
Public repository created to store my custom-made tools for Just Dance (UbiArt Engine)

Woody's Just Dance Tools Public repository created to store my custom-made tools for Just Dance (UbiArt Engine) Development and updates Almost all of

Wodson de Andrade 8 Dec 24, 2022
Cache Requests in Deta Bases and Echo them with Deta Micros

Deta Echo Cache Leverage the awesome Deta Micros and Deta Base to cache requests and echo them as needed. Stop worrying about slow public APIs or agre

Gingerbreadfork 8 Dec 07, 2021
Exploring Versatile Prior for Human Motion via Motion Frequency Guidance (3DV2021)

Exploring Versatile Prior for Human Motion via Motion Frequency Guidance [Video Demo] [Paper] Installation Requirements Python 3.6 PyTorch 1.1.0 Pleas

Jiachen Xu 19 Oct 28, 2022
🤗 Push your spaCy pipelines to the Hugging Face Hub

spacy-huggingface-hub: Push your spaCy pipelines to the Hugging Face Hub This package provides a CLI command for uploading any trained spaCy pipeline

Explosion 30 Oct 09, 2022
PyTorch framework for Deep Learning research and development.

Accelerated DL & RL PyTorch framework for Deep Learning research and development. It was developed with a focus on reproducibility, fast experimentati

Catalyst-Team 29 Jul 13, 2022
MatryODShka: Real-time 6DoF Video View Synthesis using Multi-Sphere Images

Main repo for ECCV 2020 paper MatryODShka: Real-time 6DoF Video View Synthesis using Multi-Sphere Images. visual.cs.brown.edu/matryodshka

Brown University Visual Computing Group 75 Dec 13, 2022