Official codes for the paper "Learning Hierarchical Discrete Linguistic Units from Visually-Grounded Speech"

Overview

ResDAVEnet-VQ

Official PyTorch implementation of Learning Hierarchical Discrete Linguistic Units from Visually-Grounded Speech

What is in this repo?

  • Multi-GPU training of ResDAVEnet-VQ
  • Quantitative evaluation
    • Image-to-speech and speech-to-image retrieval
    • ZeroSpeech 2019 ABX phone-discriminability test
    • Word detection
  • Qualitative evaluation
    • Visualize time-aligned word/phone/code transcripts
    • F1/recall/precision scatter plots for model/layer comparison

alt text

If you find the code useful, please cite

@inproceedings{Harwath2020Learning,
  title={Learning Hierarchical Discrete Linguistic Units from Visually-Grounded Speech},
  author={David Harwath and Wei-Ning Hsu and James Glass},
  booktitle={International Conference on Learning Representations},
  year={2020},
  url={https://openreview.net/forum?id=B1elCp4KwH}
}

Pre-trained models

Model [email protected] Link MD5 sum
{} 0.735 gDrive e3f94990c72ce9742c252b2e04f134e4
{}->{2} 0.760 gDrive d8ebaabaf882632f49f6aea0a69516eb
{}->{3} 0.794 gDrive 2c3a269c70005cbbaaa15fc545da93fa
{}->{2,3} 0.787 gDrive d0764d8e97187c8201f205e32b5f7fee
{2} 0.753 gDrive d68c942069fcdfc3944e556f6af79c60
{2}->{2,3} 0.764 gDrive 09e704f8fcd9f85be8c4d5bdf779bd3b
{2}->{2,3}->{2,3,4} 0.793 gDrive 6e403e7f771aad0c95f087318bf8447e
{3} 0.734 gDrive a0a3d5adbbd069a2739219346c8a8f70
{3}->{2,3} 0.760 gDrive 6c92bcc4445895876a7840bc6e88892b
{2,3} 0.667 gDrive 7a98a661302939817a1450d033bc2fcc

Data preparation

Download the MIT Places Image/Audio Data

We use MIT Places scene recognition database (Places Image) and a paired MIT Places Audio Caption Corpus (Places Audio) as visually-grounded speech, which contains roughly 400K image/spoken caption pairs, to train ResDAVEnet-VQ.

  • Places Image can be downloaded here
  • Places Audio can be downloaded here

Optional data preprocessing

Data specifcation files can be found at metadata/{train,val}.json inside the Places Audio directory; however, they do not include the time-aligned word transcripts for analysis. Those with alignments can be downloaded here:

Open the *.json files and update the values of image_base_path and audio_base_path to reflect the path where the image and the audio datasets are stored.

To speed up data loading, we save images and audio data into the HDF5 binary files, and use the h5py Python interface to access the data. The corresponding PyTorch Dataset class is ImageCaptionDatasetHDF5 in ./dataloaders/image_caption_dataset_hdf5.py. To prepare HDF5 datasets, run

./scripts/preprocess.sh

(We do support on-the-fly feature processing with the ImageCaptionDataset class in ./dataloaders/image_caption_dataset.py, which takes a data specification file as input (e.g., metadata/train.json). However, this can be very slow)

ImageCaptionDataset and ImageCaptionDatasetHDF5 are interchangeable, but most scripts in this repo assume the preprocessed HDF5 dataset is available. Users would have to modify the code correspondingly to use ImageCaptionDataset.

Interactive Qualtitative Evaluation

See run_evaluations.ipynb

Quantitative Evaluation

ZeroSpeech 2019 ABX Phone Discriminability Test

Users need to download the dataset and the Docker image by following the instructions here.

To extract ResDAVEnet-VQ features, see ./scripts/dump_zs19_abx.sh.

Word detection

See ./run_unit_analysis.py. It needs both HDF5 dataset and the original JSON dataset to get the time-aligned word transcripts.

Example:

python run_unit_analysis.py --hdf5_path=$hdf5_path --json_path=$json_path \
  --exp_dir=$exp_dir --layer=$layer --output_dir=$out_dir

Cross-modal retrieval

See ./run_ResDavenetVQ.py. Set --mode=eval for retrieval evaluation.

Example:

python run_ResDavenetVQ.py --resume=True --mode=eval \
  --data-train=$data_tr --data-val=$data_dt \
  --exp-dir="./exps/pretrained/RDVQ_01000_01100_01110"

Training

See ./scripts/train.sh.

To train a model from scratch with the 2nd and 3rd layers quantized, run

./scripts/train.sh 01100 RDVQ_01100 ""

To train a model with the 2nd and 3rd layers quantized, and initialize weights from a pre-trained model (e.g., ./exps/RDVQ_00000), run

./scripts/train.sh 01100 RDVQ_01100 "--seed-dir ./exps/RDVQ_00000"
Owner
Wei-Ning Hsu
Research Scientist @ Facebook AI Research (FAIR). Former PhD Student @ MIT Spoken Language Systems Group
Wei-Ning Hsu
Library for machine learning stacking generalization.

stacked_generalization Implemented machine learning *stacking technic[1]* as handy library in Python. Feature weighted linear stacking is also availab

114 Jul 19, 2022
Deep Ensembling with No Overhead for either Training or Testing: The All-Round Blessings of Dynamic Sparsity

[ICLR 2022] Deep Ensembling with No Overhead for either Training or Testing: The All-Round Blessings of Dynamic Sparsity by Shiwei Liu, Tianlong Chen, Zahra Atashgahi, Xiaohan Chen, Ghada Sokar, Elen

VITA 18 Dec 31, 2022
Official implementation of Monocular Quasi-Dense 3D Object Tracking

Monocular Quasi-Dense 3D Object Tracking Monocular Quasi-Dense 3D Object Tracking (QD-3DT) is an online framework detects and tracks objects in 3D usi

Visual Intelligence and Systems Group 441 Dec 20, 2022
Optimus: the first large-scale pre-trained VAE language model

Optimus: the first pre-trained Big VAE language model This repository contains source code necessary to reproduce the results presented in the EMNLP 2

314 Dec 19, 2022
DataCLUE: 国内首个以数据为中心的AI测评(含模型分析报告)

DataCLUE: A Benchmark Suite for Data-centric NLP You can get the english version of README. 以数据为中心的AI测评(DataCLUE) 内容导引 章节 描述 简介 介绍以数据为中心的AI测评(DataCLUE

CLUE benchmark 135 Dec 22, 2022
Local trajectory planner based on a multilayer graph framework for autonomous race vehicles.

Graph-Based Local Trajectory Planner The graph-based local trajectory planner is python-based and comes with open interfaces as well as debug, visuali

TUM - Institute of Automotive Technology 160 Jan 04, 2023
Retrieve and analysis data from SDSS (Sloan Digital Sky Survey)

Author: Behrouz Safari License: MIT sdss A python package for retrieving and analysing data from SDSS (Sloan Digital Sky Survey) Installation Install

Behrouz 3 Oct 28, 2022
Artificial Intelligence playing minesweeper 🤖

AI playing Minesweeper ✨ Minesweeper is a single-player puzzle video game. The objective of the game is to clear a rectangular board containing hidden

Vaibhaw 8 Oct 17, 2022
A minimal TPU compatible Jax implementation of NeRF: Representing Scenes as Neural Radiance Fields for View Synthesis

NeRF Minimal Jax implementation of NeRF: Representing Scenes as Neural Radiance Fields for View Synthesis. Result of Tiny-NeRF RGB Depth

Soumik Rakshit 11 Jul 24, 2022
Pytorch implementation of DeepMind's differentiable neural computer paper.

DNC pytorch This is a Pytorch implementation of DeepMind's Differentiable Neural Computer (DNC) architecture introduced in their recent Nature paper:

Yuanpu Xie 91 Nov 21, 2022
Implementation of the method described in the Speech Resynthesis from Discrete Disentangled Self-Supervised Representations.

Speech Resynthesis from Discrete Disentangled Self-Supervised Representations Implementation of the method described in the Speech Resynthesis from Di

4 Mar 11, 2022
CLNTM - Contrastive Learning for Neural Topic Model

Contrastive Learning for Neural Topic Model This repository contains the impleme

Thong Thanh Nguyen 25 Nov 24, 2022
Api for getting bin info and getting encrypted card details for adyen.

Bin Info And Adyen Cse Enc Python api for getting bin info and getting encrypted

Roldex Stark 8 Dec 30, 2022
Recursive Bayesian Networks

Recursive Bayesian Networks This repository contains the code to reproduce the results from the NeurIPS 2021 paper Lieck R, Rohrmeier M (2021) Recursi

Robert Lieck 11 Oct 18, 2022
Code for “ACE-HGNN: Adaptive Curvature ExplorationHyperbolic Graph Neural Network”

ACE-HGNN: Adaptive Curvature Exploration Hyperbolic Graph Neural Network This repository is the implementation of ACE-HGNN in PyTorch. Environment pyt

9 Nov 28, 2022
RoMa: A lightweight library to deal with 3D rotations in PyTorch.

RoMa: A lightweight library to deal with 3D rotations in PyTorch. RoMa (which stands for Rotation Manipulation) provides differentiable mappings betwe

NAVER 90 Dec 27, 2022
Precomputed Real-Time Texture Synthesis with Markovian Generative Adversarial Networks

MGANs Training & Testing code (torch), pre-trained models and supplementary materials for "Precomputed Real-Time Texture Synthesis with Markovian Gene

290 Nov 15, 2022
This GitHub repository contains code used for plots in NeurIPS 2021 paper 'Stochastic Multi-Armed Bandits with Control Variates.'

About Repository This repository contains code used for plots in NeurIPS 2021 paper 'Stochastic Multi-Armed Bandits with Control Variates.' About Code

Arun Verma 1 Nov 09, 2021
Taking A Closer Look at Domain Shift: Category-level Adversaries for Semantics Consistent Domain Adaptation

Taking A Closer Look at Domain Shift: Category-level Adversaries for Semantics Consistent Domain Adaptation (CVPR2019) This is a pytorch implementatio

Yawei Luo 280 Jan 01, 2023
Video-based open-world segmentation

UVO_Challenge Team Alpes_runner Solutions This is an official repo for our UVO Challenge solutions for Image/Video-based open-world segmentation. Our

Yuming Du 84 Dec 22, 2022