Hierarchical Cross-modal Talking Face Generation with Dynamic Pixel-wise Loss （ATVGnet）

Last update: Dec 27, 2022

Related tags

Deep Learning ATVGnet

Overview

Hierarchical Cross-modal Talking Face Generation with Dynamic Pixel-wise Loss （ATVGnet）

By Lele Chen , Ross K Maddox, Zhiyao Duan, Chenliang Xu.

University of Rochester.

Introduction
Citation
Running
Model
Results
Disclaimer and known issues

Introduction

This repository contains the original models (AT-net, VG-net) described in the paper Hierarchical Cross-modal Talking Face Generation with Dynamic Pixel-wise Loss. The demo video is avaliable at https://youtu.be/eH7h_bDRX2Q. This code can be applied directly in LRW and GRID. The outputs from the model are visualized here: the first one is the synthesized landmark from ATnet, the rest of them are attention, motion map and final results from VGnet.

Citation

If you use any codes, models or the ideas from this repo in your research, please cite:

@inproceedings{chen2019hierarchical,
  title={Hierarchical cross-modal talking face generation with dynamic pixel-wise loss},
  author={Chen, Lele and Maddox, Ross K and Duan, Zhiyao and Xu, Chenliang},
  booktitle={Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition},
  pages={7832--7841},
  year={2019}
}

Running

This code is tested under Python 2.7. The model we provided is trained on LRW. However, it works fine on GRID,VOXCELB and other datasets. You can directly compare this model on other dataset with your own model. We treat this as fair comparison.
Pytorch environment:Pytorch 0.4.1. (conda install pytorch=0.4.1 torchvision cuda90 -c pytorch)
Install requirements.txt (pip install -r requirement.txt)
Download the pretrained ATnet and VGnet weights at google drive. Put the weights under model folder.
Run the demo code: python demo.py
- -device_ids: gpu id
- -cuda: using cuda or not
- -vg_model: pretrained VGnet weight
- -at_model: pretrained ATnet weight
- -lstm: use lstm or not
- -p: input example image
- -i: input audio file
- -lstm: use lstm or not
- -sample_dir: folder to save the outputs
- ...
Download and unzip the training data from LRW
Preprocess the data (Extract landmark and crop the image by dlib).
Train the ATnet model: python atnet.py
- -device_ids: gpu id
- -batch_size: batch size
- -model_dir: folder to save weights
- -lstm: use lstm or not
- -sample_dir: folder to save visualized images during training
- ...
Test the model: python atnet_test.py
- -device_ids: gpu id
- -batch_size: batch size
- -model_name: pretrained weights
- -sample_dir: folder to save the outputs
- -lstm: use lstm or not
- ...
Train the VGnet: python vgnet.py
- -device_ids: gpu id
- -batch_size: batch size
- -model_dir: folder to save weights
- -sample_dir: folder to save visualized images during training
- ...
Test the VGnet: python vgnet_test.py
- -device_ids: gpu id
- -batch_size: batch size
- -model_name: pretrained weights
- -sample_dir: folder to save the outputs
- ...

Model

Overall ATVGnet
Regresssion based discriminator network

Results

Result visualization on different datasets:
Reuslt compared with other SOTA methods:
The studies on image robustness respective with landmark accuracy:
Quantitative results:

Disclaimer and known issues

These codes are implmented in Pytorch.
In this paper, we train LRW and GRID seperately.
The model are sensitive to input images. Please use the correct preprocessing code.
I didn't finish the data processing code yet. I will release it soon. But you can try the model and replace with your own image.
If you want to train these models using this version of pytorch without modifications, please notice that:
- You need at lest 12 GB GPU memory.
- There might be some other untested issues.
There is another intresting and useful research on audio to landmark genration. Please check it out at https://github.com/eeskimez/Talking-Face-Landmarks-from-Speech.

Todos

Release training data

License

MIT

Hierarchical Cross-modal Talking Face Generation with Dynamic Pixel-wise Loss （ATVGnet）

Related tags

Overview

Hierarchical Cross-modal Talking Face Generation with Dynamic Pixel-wise Loss （ATVGnet）

Table of Contents

Introduction

Citation

Running

Model

Results

Disclaimer and known issues

Todos

License

Owner

Lele Chen

Anomaly detection related books, papers, videos, and toolboxes

This is an unofficial implementation of the paper “Student-Teacher Feature Pyramid Matching for Unsupervised Anomaly Detection”.

Haze Removal can remove slight to extreme cases of haze affecting an image

Realtime YOLO Monster Detection With Non Maximum Supression

Lighting the Darkness in the Deep Learning Era: A Survey, An Online Platform, A New Dataset

What can linearized neural networks actually say about generalization?

VLGrammar: Grounded Grammar Induction of Vision and Language

GluonMM is a library of transformer models for computer vision and multi-modality research

Official PyTorch implementation of CAPTRA: CAtegory-level Pose Tracking for Rigid and Articulated Objects from Point Clouds

AdamW optimizer and cosine learning rate annealing with restarts

This repo contains the code and data used in the paper "Wizard of Search Engine: Access to Information Through Conversations with Search Engines"

Run containerized, rootless applications with podman

π-GAN: Periodic Implicit Generative Adversarial Networks for 3D-Aware Image Synthesis

Benchmark for the generalization of 3D machine learning models across different remeshing/samplings of a surface.

Public scripts, services, and configuration for running a smart home K3S network cluster

This is the implementation of the paper LiST: Lite Self-training Makes Efficient Few-shot Learners.

Simple reimplemetation experiments about FcaNet

Semantic Segmentation in Pytorch. Network include: FCN、FCN_ResNet、SegNet、UNet、BiSeNet、BiSeNetV2、PSPNet、DeepLabv3_plus、 HRNet、DDRNet

Open & Efficient for Framework for Aspect-based Sentiment Analysis

TorchMD-Net provides state-of-the-art graph neural networks and equivariant transformer neural networks potentials for learning molecular potentials