YouRefIt: Embodied Reference Understanding with Language and Gesture

Last update: Jul 11, 2022

Related tags

Deep Learning YouRefIt_ERU

Overview

YouRefIt: Embodied Reference Understanding with Language and Gesture

by Yixin Chen, Qing Li, Deqian Kong, Yik Lun Kei, Tao Gao, Yixin Zhu, Song-Chun Zhu and Siyuan Huang

The IEEE International Conference on Computer Vision (ICCV), 2021

Introduction

We study the machine's understanding of embodied reference: One agent uses both language and gesture to refer to an object to another agent in a shared physical environment. To tackle this problem, we introduce YouRefIt, a new crowd-sourced, real-world dataset of embodied reference.

For more details, please refer to our paper.

Checklist

Image ERU
Video ERU

Installation

The code was tested with the following environment: Ubuntu 18.04/20.04, python 3.7/3.8, pytorch 1.9.1. Run

    git clone https://github.com/yixchen/YouRefIt_ERU
    pip install -r requirements.txt

Dataset

Download the YouRefIt dataset from Dataset Request Page and put under ./ln_data

Model weights

Yolov3: download the pretrained model and place the file in ./saved_models by
```
sh saved_models/yolov3_weights.sh
```
More pretrained models are availble Google drive, and should also be placed in ./saved_models.

Make sure to put the files in the following structure:

|-- ROOT
|	|-- ln_data
|		|-- yourefit
|			|-- images
|			|-- paf
|			|-- saliency
|	|-- saved_modeks
|		|-- final_model_full.tar
|		|-- final_resc.tar

Training

Train the model, run the code under main folder.

python train.py --data_root ./ln_data/ --dataset yourefit --gpu gpu_id

Evaluation

Evaluate the model, run the code under main folder. Using flag --test to access test mode.

python train.py --data_root ./ln_data/ --dataset yourefit --gpu gpu_id \
 --resume saved_models/model.pth.tar \
 --test

Evaluate Image ERU on our released model

Evaluate our full model with PAF and saliency feature, run

python train.py --data_root ./ln_data/ --dataset yourefit  --gpu gpu_id \
 --resume saved_models/final_model_full.tar --use_paf --use_sal --large --test

Evaluate baseline model that only takes images as input, run

python train.py --data_root ./ln_data/ --dataset yourefit  --gpu gpu_id \
 --resume saved_models/final_resc.tar --large --test

Evalute the inference results on test set on different IOU levels by changing the path accordingly,

 python evaluate_results.py

Citation

@inProceedings{chen2021yourefit,
 title={YouRefIt: Embodied Reference Understanding with Language and Gesture},
 author = {Chen, Yixin and Li, Qing and Kong, Deqian and Kei, Yik Lun and Zhu, Song-Chun and Gao, Tao and Zhu, Yixin and Huang, Siyuan},
 booktitle={The IEEE International Conference on Computer Vision (ICCV),
 year={2021}
 }

Acknowledgement

Our code is built on ReSC and we thank the authors for their hard work.

YouRefIt: Embodied Reference Understanding with Language and Gesture

Related tags

Overview

YouRefIt: Embodied Reference Understanding with Language and Gesture

Introduction

Checklist

Installation

Dataset

Model weights

Training

Evaluation

Evaluate Image ERU on our released model

Citation

Acknowledgement

Owner

PuppetGAN - Cross-Domain Feature Disentanglement and Manipulation just got way better! 🚀

Si Adek Keras is software VR dangerous object detection.

A Python implementation of the Locality Preserving Matching (LPM) method for pruning outliers in image matching.

Event queue (Equeue) dialect is an MLIR Dialect that models concurrent devices in terms of control and structure.

Just Randoms Cats with python

Object-aware Contrastive Learning for Debiased Scene Representation

A curated list of awesome neural radiance fields papers

Mae segmentation - Reproduction of semantic segmentation using masked autoencoder (mae)

Reproduces the results of the paper "Finite Basis Physics-Informed Neural Networks (FBPINNs): a scalable domain decomposition approach for solving differential equations".

Pytorch Implementation of paper "Noisy Natural Gradient as Variational Inference"

Self-Supervised Monocular 3D Face Reconstruction by Occlusion-Aware Multi-view Geometry Consistency[ECCV 2020]

A Broader Picture of Random-walk Based Graph Embedding

Deploy a ML inference service on a budget in less than 10 lines of code.

MakeItTalk: Speaker-Aware Talking-Head Animation

Code for Piggyback: Adapting a Single Network to Multiple Tasks by Learning to Mask Weights

This repository contains a CBIR system that uses swin transformer to extract image's feature.

Code for the paper: On Pathologies in KL-Regularized Reinforcement Learning from Expert Demonstrations

BirdCLEF 2021 - Birdcall Identification 4th place solution

PCACE: A Statistical Approach to Ranking Neurons for CNN Interpretability

Good Classification Measures and How to Find Them