The simple project to separate mixed voice (2 clean voices) to 2 separate voices.

Last update: Oct 30, 2022

Related tags

Text Data & NLP speech_separation_PIT

Overview

Speech Separation

The simple project to separate mixed voice (2 clean voices) to 2 separate voices.

Result Example (Clisk to hear the voices): mix || prediction voice1 || prediction voice2

Mix Spectrogram

Predict Voice1's Spectrogram

Predict Voice2's Spectrogram

1. Quick train

Step 1:

Download LibriMixSmall, extract it and move it to the root of the project.

Step 2:

./train.sh

It will take about ONLY 2-3 HOURS to train with normal GPU. After each epoch, the prediction is generated to ./viz_outout folder.

2. Quick inference

./inference.sh The result will be generated to ./viz_outout folder.

3. More detail

Input: The Complex spectrogram. Get from the raw mixed audio signal
Output: The complex ratio mask (cRM) ---> complex spectrogram ---> separated voices.
Model: Use the simple version of this implementation , which is defined in paper Looking to Listen at the Cocktail Party: A Speaker-Independent Audio-Visual Model for Speech Separation
Loss function: Permutation Invariant Training Loss and PairWise Neg SisDr Loss (more SOTA)
Dataset: A small version of LibriMix dataset. I get from LibriMixSmall

4. Current problem

Due to small dataset size for fast training, the model is a bit overfitting to the training set. Use the bigger dataset will potentially help to overcome that. Some suggestions:

Use the original LibriMix Dataset which is way much bigger (around 60 times bigger that what I have trained).
Use this work to download much more in-the-wild dataset and use datasets/VoiceMixtureDataset.py instead of the Libri one that I am using. p/s I have trained and it work too.

The simple project to separate mixed voice (2 clean voices) to 2 separate voices.

Related tags

Overview

Speech Separation

1. Quick train

Step 1:

Step 2:

2. Quick inference

3. More detail

4. Current problem

Owner

vuthede

Biterm Topic Model (BTM): modeling topics in short texts

Exploring dimension-reduced embeddings

Natural Language Processing with transformers

LightSeq: A High-Performance Inference Library for Sequence Processing and Generation

Code Generation using a large neural network called GPT-J

MILES is a multilingual text simplifier inspired by LSBert - A BERT-based lexical simplification approach proposed in 2018. Unlike LSBert, MILES uses the bert-base-multilingual-uncased model, as well as simple language-agnostic approaches to complex word identification (CWI) and candidate ranking.

text to speech toolkit. 好用的中文语音合成工具箱，包含语音编码器、语音合成器、声码器和可视化模块。

easySpeech is an open-source Python wrapper for google speech to text API that doesn't require PyAudio(So you especially windows user don't have to deal with the errors while installing PyAudio) and also works with hugging face transformers

Machine Psychology: Python Generated Art

KakaoBrain KoGPT (Korean Generative Pre-trained Transformer)

KLUE-baseline contains the baseline code for the Korean Language Understanding Evaluation (KLUE) benchmark.

Dense Passage Retriever - is a set of tools and models for open domain Q&A task.

Chinese version of GPT2 training code, using BERT tokenizer.

Stack based programming language that compiles to x86_64 assembly or can alternatively be interpreted in Python

Big Bird: Transformers for Longer Sequences

A raytrace framework using taichi language

Source code for AAAI20 "Generating Persona Consistent Dialogues by Exploiting Natural Language Inference".

Wrapper to display a script output or a text file content on the desktop in sway or other wlroots-based compositors

This is the main repository of open-sourced speech technology by Huawei Noah's Ark Lab.

🦅 Pretrained BigBird Model for Korean (up to 4096 tokens)