Unofficial reimplementation of ECAPA-TDNN for speaker recognition (EER=0.86 for Vox1_O when train only in Vox2)

Last update: Dec 31, 2022

Overview

Introduction

This repository contains my unofficial reimplementation of the standard ECAPA-TDNN, which is the speaker recognition in VoxCeleb2 dataset.

This repository is modified based on voxceleb_trainer.

Best Performance in this project (with AS-norm)

Dataset	Vox1_O	Vox1_E	Vox1_H
EER	0.86	1.18	2.17
minDCF	0.0686	0.0765	0.1295

System Description

I will write a technique report about this system and all the details later. Please wait.

Dependencies

Note: That is the setting based on my device, you can modify the torch and torchaudio version based on your device.

Start from building the environment

conda create -n ECAPA python=3.7.9 anaconda
conda activate ECAPA
pip install -r requirements.txt

Start from the existing environment

pip install -r requirements.txt

Data preparation

Please follow the official code to perpare your VoxCeleb2 dataset from the 'Data preparation' part in this repository.

Dataset for training usage:

VoxCeleb2 training set;
MUSAN dataset;
RIR dataset.

Dataset for evaluation:

VoxCeleb1 test set for Vox1_O
VoxCeleb1 train set for Vox1_E and Vox1_H (Optional)

Training

Then you can change the data path in the trainECAPAModel.py. Train ECAPA-TDNN model end-to-end by using:

python trainECAPAModel.py --save_path exps/exp1

Every test_step epoches, system will be evaluated in Vox1_O set and print the EER.

The result will be saved in exps/exp1/score.txt. The model will saved in exps/exp1/model

In my case, I trained 80 epoches in one 3090 GPU. Each epoch takes 37 mins, the total training time is about 48 hours.

Pretrained model

Our pretrained model performs EER: 0.96 in Vox1_O set without AS-norm, you can check it by using:

python trainECAPAModel.py --eval --initial_model exps/pretrain.model

With AS-norm, this system performs EER: 0.86, we will release the code of AS-norm later.

We also update the score.txt file in exps/pretrain_score.txt, it contains the training loss, training acc and EER in Vox1_O in each epoch for your reference.

Reference

@inproceedings{desplanques2020ecapa,
  title={{ECAPA-TDNN: Emphasized Channel Attention, propagation and aggregation in TDNN based speaker verification}},
  author={Desplanques, Brecht and Thienpondt, Jenthe and Demuynck, Kris},
  booktitle={Interspeech 2020},
  pages={3830--3834},
  year={2020}
}
@inproceedings{chung2020in,
  title={In defence of metric learning for speaker recognition},
  author={Chung, Joon Son and Huh, Jaesung and Mun, Seongkyu and Lee, Minjae and Heo, Hee Soo and Choe, Soyeon and Ham, Chiheon and Jung, Sunghwan and Lee, Bong-Jin and Han, Icksang},
  booktitle={Interspeech},
  year={2020}
}

Acknowledge

We study many useful projects in our codeing process, which includes:

clovaai/voxceleb_trainer.

lawlict/ECAPA-TDNN.

speechbrain/speechbrain

ranchlai/speaker-verification

Thanks for these authors to open source their code!

Notes

If you meet the problems about this repository, Please ask me from the 'issue' part in Github (using English) instead of sending the messages to me from bilibili, so others can also benifit from it. Thanks for your understanding!

If you improve the result based on this repository by some methods, please let me know. Thanks!

Unofficial reimplementation of ECAPA-TDNN for speaker recognition (EER=0.86 for Vox1_O when train only in Vox2)

Related tags

Overview

Introduction

Best Performance in this project (with AS-norm)

System Description

Dependencies

Data preparation

Training

Pretrained model

Reference

Acknowledge

Notes

Owner

Tao Ruijie

Rasterize with the least efforts for researchers.

Code for the paper "Adversarial Generator-Encoder Networks"

SuMa++: Efficient LiDAR-based Semantic SLAM (Chen et al IROS 2019)

Using BERT+Bi-LSTM+CRF

Official Implementation of DE-DETR and DELA-DETR in "Towards Data-Efficient Detection Transformers"

An official reimplementation of the method described in the INTERSPEECH 2021 paper - Speech Resynthesis from Discrete Disentangled Self-Supervised Representations.

QR2Pass-project - A proof of concept for an alternative (passwordless) authentication system to a web server

Diverse Object-Scene Compositions For Zero-Shot Action Recognition

THIS IS THE OLD PYMC PROJECT. PLEASE USE PYMC3 INSTEAD:

Pytorch ImageNet1k Loader with Bounding Boxes.

Jingju baseline - A baseline model of our project of Beijing opera script generation

Code for the CIKM 2019 paper "DSANet: Dual Self-Attention Network for Multivariate Time Series Forecasting".

Agent-based model simulator for air quality and pandemic risk assessment in architectural spaces

Repository for the paper "Online Domain Adaptation for Occupancy Mapping", RSS 2020

Receptive Field Block Net for Accurate and Fast Object Detection, ECCV 2018

Classification models 1D Zoo - Keras and TF.Keras

pixelNeRF: Neural Radiance Fields from One or Few Images

Deep Learning applied to Integral data analysis

D²Conv3D: Dynamic Dilated Convolutions for Object Segmentation in Videos

A collection of pre-trained StyleGAN2 models trained on different datasets at different resolution.

Unofficial reimplementation of ECAPA-TDNN for speaker recognition (EER=0.86 for Vox1_O when train only in Vox2)

Related tags

Overview

Introduction

Best Performance in this project (with AS-norm)

System Description

Dependencies

Data preparation

Training

Pretrained model

Reference

Acknowledge

Notes

Owner

Tao Ruijie

Rasterize with the least efforts for researchers.

Code for the paper "Adversarial Generator-Encoder Networks"

SuMa++: Efficient LiDAR-based Semantic SLAM (Chen et al IROS 2019)

Using BERT+Bi-LSTM+CRF

Official Implementation of DE-DETR and DELA-DETR in "Towards Data-Efficient Detection Transformers"

An official reimplementation of the method described in the INTERSPEECH 2021 paper - Speech Resynthesis from Discrete Disentangled Self-Supervised Representations.

QR2Pass-project - A proof of concept for an alternative (passwordless) authentication system to a web server

Diverse Object-Scene Compositions For Zero-Shot Action Recognition

THIS IS THE **OLD** PYMC PROJECT. PLEASE USE PYMC3 INSTEAD:

Pytorch ImageNet1k Loader with Bounding Boxes.

Jingju baseline - A baseline model of our project of Beijing opera script generation

Code for the CIKM 2019 paper "DSANet: Dual Self-Attention Network for Multivariate Time Series Forecasting".

Agent-based model simulator for air quality and pandemic risk assessment in architectural spaces

Repository for the paper "Online Domain Adaptation for Occupancy Mapping", RSS 2020

Receptive Field Block Net for Accurate and Fast Object Detection, ECCV 2018

Classification models 1D Zoo - Keras and TF.Keras

pixelNeRF: Neural Radiance Fields from One or Few Images

Deep Learning applied to Integral data analysis

D²Conv3D: Dynamic Dilated Convolutions for Object Segmentation in Videos

A collection of pre-trained StyleGAN2 models trained on different datasets at different resolution.

THIS IS THE OLD PYMC PROJECT. PLEASE USE PYMC3 INSTEAD: