Pytorch implementation of our paper LIMUSE: LIGHTWEIGHT MULTI-MODAL SPEAKER EXTRACTION.

Last update: Oct 26, 2022

Related tags

Overview

LiMuSE

Overview

Pytorch implementation of our paper LIMUSE: LIGHTWEIGHT MULTI-MODAL SPEAKER EXTRACTION.

LiMuSE explores group communication on a multi-modal speaker extraction model and further compresses the model size with quantization strategy.

Model

Our proposed model is a multi-steam architecture that takes multichannel mixture, target speaker’s enrolled utterance and visual sequences of detected faces as inputs, and outputs the target speaker’s mask in time domain. The encoded audio representations of mixture are then multiplied by the generated mask to obtain the target speech. Please see the figure below for detailed model structure.

Datasets

We evaluate our system on two-speaker speech separation and speaker extraction problems using GRID dataset. The pretrained face embedding extraction network is trained on LRW dataset and MS-Celeb-1M dataset. And we use SMS-WSJ toolkit to obtain simulated anechoic dual-channel audio mixture. We place 2 microphones at the center of the room. The distance between microphones is 7 cm.

Getting Started

Preparation

If you want to adjust configurations of the framework and the path of dataset, please modify the option/train/train.yml file.

Training

Specify the path to train.yml file and run the training command:

python train.py -opt ./option/train/train.yml

This project supports full-precision and quantization training at the same time. Note that you need to modify two values of QA_flag in train.yml file if you would like to switch between full-precision and quantization stage. QA_flag in training settings stands for weight quantization while the one in net_conf stands for activation quantization.

View tensorboardX

tensorboard --logdir ./tensorboard

Result

Hyperparameters of LiMuSE

Symbol	Description	Value
N	Number of filters in auto-encoder	128
L	Length of the filters (in audio samples)	16
T	Temperature	5
X	Number of GC-equipped TCN blocks in each repeat	6
Ra	Number of repeats in audio block	2
Rb	Number of repeats in fusion block	1
K	Number of groups	-

Performance of LiMuSE and TasNet under various configurations. Q stands for quantization, VIS stands for visual cue and VP stands for voiceprint cue. Model size and compression ratio are also reported.

Method	K	SI-SDR (dB)	#Params	Model Size	Compression Ratio
LiMuSE	32	16.72	0.36M	0.16MB	223.75
	16	18.08	0.96M	0.40MB	89.50
LiMuSE (w/o Q)	32	23.77	0.36M	1.44MB	24.86
	16	24.90	0.96M	3.84MB	9.32
LiMuSE (w/o Q and VP)	32	18.60	0.19M	0.76MB	47.11
	16	24.20	0.52M	2.08MB	17.21
LiMuSE (w/o Q and VIS)	32	15.68	0.22M	0.88MB	40.68
	16	21.91	0.55M	2.20MB	16.27
LiMuSE (w/o Q and GC)	-	23.67	8.95M	35.8MB	1
TasNet (dual-channel)	-	19.94	2.48M	9.92MB	-
TasNet (single-channel)	-	13.15	2.48M	9.92MB	-

Citations

If you find this repo helpful, please consider citing:

@inproceedings{liu2021limuse,
  title={LIMUSE: LIGHTWEIGHT MULTI-MODAL SPEAKER EXTRACTION},
  author={Liu, Qinghua and Huang, Yating and Hao, Yunzhe and Xu, Jiaming and Xu, Bo},
  booktitle={arXiv:2111.04063},
  year={2021},
}

Pytorch implementation of our paper LIMUSE: LIGHTWEIGHT MULTI-MODAL SPEAKER EXTRACTION.

Related tags

Overview

LiMuSE

Overview

Model

Datasets

Getting Started

Preparation

Training

View tensorboardX

Result

Citations

Owner

Auditory Model and Cognitive Computing Lab

Semi-supervised Representation Learning for Remote Sensing Image Classification Based on Generative Adversarial Networks

DvD-TD3: Diversity via Determinants for TD3 version

Implementation of the paper "Language-agnostic representation learning of source code from structure and context".

ViDT: An Efficient and Effective Fully Transformer-based Object Detector

Real-time ground filtering algorithm of cloud points acquired using Terrestrial Laser Scanner (TLS)

Code for MSc Quantitative Finance Dissertation

Organseg dags - The repository contains the codebase for multi-organ segmentation with directed acyclic graphs (DAGs) in CT.

Code accompanying "Evolving spiking neuron cellular automata and networks to emulate in vitro neuronal activity," accepted to IEEE SSCI ICES 2021

Pixel-level Crack Detection From Images Of Levee Systems : A Comparative Study

source code of “Visual Saliency Transformer” (ICCV2021)

Code of the paper "Performance-Efficiency Trade-offs in Unsupervised Pre-training for Speech Recognition"

This repository contains the code for the paper "Hierarchical Motion Understanding via Motion Programs"

Contenido del curso Bases de datos del DCC PUC versión 2021-2

Real-Time High-Resolution Background Matting

Digitalizing-Prescription-Image - PIRDS - Prescription Image Recognition and Digitalizing System is a OCR make with Tensorflow

Official Pytorch implementation of "DivCo: Diverse Conditional Image Synthesis via Contrastive Generative Adversarial Network" (CVPR'21)

Official implementation of "Membership Inference Attacks Against Self-supervised Speech Models"

Official implementation of NLOS-OT: Passive Non-Line-of-Sight Imaging Using Optimal Transport (IEEE TIP, accepted)

A tutorial showing how to train, convert, and run TensorFlow Lite object detection models on Android devices, the Raspberry Pi, and more!

Source code for models described in the paper "AudioCLIP: Extending CLIP to Image, Text and Audio" (https://arxiv.org/abs/2106.13043)