Video Autoencoder: self-supervised disentanglement of 3D structure and motion

Overview

Video Autoencoder: self-supervised disentanglement of 3D structure and motion

This repository contains the code (in PyTorch) for the model introduced in the following paper:

Video Autoencoder: self-supervised disentanglement of 3D structure and motion
Zihang Lai, Sifei Liu, Alexi A. Efros, Xiaolong Wang
ICCV, 2021
[Paper] [Project Page] [12-min oral pres. video] [3-min supplemental video]

Figure

Citation

@inproceedings{Lai21a,
        title={Video Autoencoder: self-supervised disentanglement of 3D structure and motion},
        author={Lai, Zihang and Liu, Sifei and Efros, Alexei A and Wang, Xiaolong},
        booktitle={ICCV},
        year={2021}
}

Contents

  1. Introduction
  2. Data preparation
  3. Training
  4. Evaluation
  5. Pretrained model

Introduction

Figure We present Video Autoencoder for learning disentangled representations of 3D structure and camera pose from videos in a self-supervised manner. Relying on temporal continuity in videos, our work assumes that the 3D scene structure in nearby video frames remains static. Given a sequence of video frames as input, the Video Autoencoder extracts a disentangled representation of the scene including: (i) a temporally-consistent deep voxel feature to represent the 3D structure and (ii) a 3D trajectory of camera poses for each frame. These two representations will then be re-entangled for rendering the input video frames. Video Autoencoder can be trained directly using a pixel reconstruction loss, without any ground truth 3D or camera pose annotations. The disentangled representation can be applied to a range of tasks, including novel view synthesis, camera pose estimation, and video generation by motion following. We evaluate our method on several large-scale natural video datasets, and show generalization results on out-of-domain images.

Dependencies

The following dependencies are not strict - they are the versions that we use.

Data preparation

RealEstate10K:

  1. Download the dataset from RealEstate10K.
  2. Download videos from RealEstate10K dataset, decode videos into frames. You might find the RealEstate10K_Downloader written by cashiwamochi helpful. Organize the data files into the following structure:
RealEstate10K/
    train/
        0000cc6d8b108390.txt
        00028da87cc5a4c4.txt
        ...
    test/
        000c3ab189999a83.txt
        000db54a47bd43fe.txt
        ...
dataset/
    train/
        0000cc6d8b108390/
            52553000.jpg
            52586000.jpg
            ...
        00028da87cc5a4c4/
            ...
    test/
        000c3ab189999a83/
        ...
  1. Subsample the training set at one-third of the original frame-rate (so that the motion is sufficiently large). You can use scripts/subsample_dataset.py.
  2. A list of videos ids that we used (10K for training and 5K for testing) is provided here:
    1. Training video ids and testing video ids.
    2. Note: as time changes, the availability of videos could change.

Matterport 3D (this could be tricky):

  1. Install habitat-api and habitat-sim. You need to use the following repo version (see this SynSin issue for details):

    1. habitat-sim: d383c2011bf1baab2ce7b3cd40aea573ad2ddf71
    2. habitat-api: e94e6f3953fcfba4c29ee30f65baa52d6cea716e
  2. Download the models from the Matterport3D dataset and the point nav datasets. You should have a dataset folder with the following data structure:

    root_folder/
         mp3d/
             17DRP5sb8fy/
                 17DRP5sb8fy.glb  
                 17DRP5sb8fy.house  
                 17DRP5sb8fy.navmesh  
                 17DRP5sb8fy_semantic.ply
             1LXtFkjw3qL/
                 ...
             1pXnuDYAj8r/
                 ...
             ...
         pointnav/
             mp3d/
                 ...
    
  3. Walk-through videos for pretraining: We use a ShortestPathFollower function provided by the Habitat navigation package to generate episodes of tours of the rooms. See scripts/generate_matterport3d_videos.py for details.

  4. Training and testing view synthesis pairs: we generally follow the same steps as the SynSin data instruction. The main difference is that we precompute all the image pairs. See scripts/generate_matterport3d_train_image_pairs.py and scripts/generate_matterport3d_test_image_pairs.py for details.

###Replica:

  1. Testing view synthesis pairs: This procedure is similar to step 4 in Matterport3D - with only the specific dataset changed. See scripts/generate_replica_test_image_pairs.py for details.

Configurations

Finally, change the data paths in configs/dataset.yaml to your data location.

Pre-trained models

  • Pre-trained model (RealEstate10K): Link
  • Pre-trained model (Matterport3D): Link

Training:

Use this script:

CUDA_VISIBLE_DEVICES=0,1 python train.py --savepath log/train --dataset RealEstate10K

Some optional commands (w/ default value in square bracket):

  • Select dataset: --dataset [RealEstate10K]
  • Interval between clip frames: --interval [1]
  • Change clip length: --clip_length [6]
  • Increase/decrease lr step: --lr_adj [1.0]
  • For Matterport3D finetuning, you need to set --clip_length 2, because the data are pairs of images.

Evaluation:

1. Generate test results:

Use this script (for testing RealEstate10K):

CUDA_VISIBLE_DEVICES=0 python test_re10k.py --savepath log/model --resume log/model/checkpoint.tar --dataset RealEstate10K

or this script (for testing Matterport3D/Replica):

CUDA_VISIBLE_DEVICES=0 python test_mp3d.py --savepath log/model --resume log/model/checkpoint.tar --dataset Matterport3D

Some optional commands:

  • Select dataset: --dataset [RealEstate10K]
  • Max number of frames: --frame_limit [30]
  • Max number of sequences: --video_limit [100]
  • Use training set to evaluate: --train_set

Running this will generate a output folder where the results (videos and poses) save. If you want to visualize the pose, use packages for evaluation of odometry, such as evo. If you want to quantitatively evaluate the results, see 2.1, 2.2.

2.1 Quantitative Evaluation of synthesis results:

Use this script:

python eval_syn_re10k.py [OUTPUT_DIR] (for RealEstate10K)
python eval_syn_mp3d.py [OUTPUT_DIR] (for Matterport3D)

Optional commands:

  • Evaluate LPIPS: --lpips

2.2 Quantitative Evaluation of pose prediction results:

Use this script:

python eval_pose.py [POSE_DIR]

Contact

For any questions about the code or the paper, you can contact zihang.lai at gmail.com.

Owner
Working from home
This computer program provides a reference implementation of Lagrangian Monte Carlo in metric induced by the Monge patch

This computer program provides a reference implementation of Lagrangian Monte Carlo in metric induced by the Monge patch. The code was prepared to the final version of the accepted manuscript in AIST

Marcelo Hartmann 2 May 06, 2022
Qlib is an AI-oriented quantitative investment platform

Qlib is an AI-oriented quantitative investment platform, which aims to realize the potential, empower the research, and create the value of AI technologies in quantitative investment.

Microsoft 10.1k Dec 30, 2022
Config files for my GitHub profile.

Canalyst Candas Data Science Library Name Canalyst Candas Description Built by a former PM / analyst to give anyone with a little bit of Python knowle

Canalyst Candas 13 Jun 24, 2022
Dataset and Source code of paper 'Enhancing Keyphrase Extraction from Academic Articles with their Reference Information'.

Enhancing Keyphrase Extraction from Academic Articles with their Reference Information Overview Dataset and code for paper "Enhancing Keyphrase Extrac

15 Nov 24, 2022
Unsupervised Real-World Super-Resolution: A Domain Adaptation Perspective

Unofficial pytorch implementation of the paper "Unsupervised Real-World Super-Resolution: A Domain Adaptation Perspective"

16 Nov 21, 2022
MIRACLE (Missing data Imputation Refinement And Causal LEarning)

MIRACLE (Missing data Imputation Refinement And Causal LEarning) Code Author: Trent Kyono This repository contains the code used for the "MIRACLE: Cau

van_der_Schaar \LAB 15 Dec 29, 2022
Simulator for FRC 2022 challenge: Rapid React

rrsim Simulator for FRC 2022 challenge: Rapid React out-1.mp4 Usage In order to run the simulator use the following: python3 rrsim.py [config_path] wh

1 Jan 18, 2022
Continuous Time LiDAR odometry

CT-ICP: Elastic SLAM for LiDAR sensors This repository implements the SLAM CT-ICP (see our article), a lightweight, precise and versatile pure LiDAR o

385 Dec 29, 2022
Detecting Blurred Ground-based Sky/Cloud Images

Detecting Blurred Ground-based Sky/Cloud Images With the spirit of reproducible research, this repository contains all the codes required to produce t

1 Oct 20, 2021
PyTorch implementation of NeurIPS 2021 paper: "CoFiNet: Reliable Coarse-to-fine Correspondences for Robust Point Cloud Registration"

CoFiNet: Reliable Coarse-to-fine Correspondences for Robust Point Cloud Registration (NeurIPS 2021) PyTorch implementation of the paper: CoFiNet: Reli

76 Jan 03, 2023
[CVPR 2022 Oral] EPro-PnP: Generalized End-to-End Probabilistic Perspective-n-Points for Monocular Object Pose Estimation

EPro-PnP EPro-PnP: Generalized End-to-End Probabilistic Perspective-n-Points for Monocular Object Pose Estimation In CVPR 2022 (Oral). [paper] Hanshen

同济大学智能汽车研究所综合感知研究组 ( Comprehensive Perception Research Group under Institute of Intelligent Vehicles, School of Automotive Studies, Tongji University) 842 Jan 04, 2023
Transformers4Rec is a flexible and efficient library for sequential and session-based recommendation, available for both PyTorch and Tensorflow.

Transformers4Rec is a flexible and efficient library for sequential and session-based recommendation, available for both PyTorch and Tensorflow.

730 Jan 09, 2023
An excellent hash algorithm combining classical sponge structure and RNN.

SHA-RNN Recurrent Neural Network with Chaotic System for Hash Functions Anonymous Authors [摘要] 在这次作业中我们提出了一种新的 Hash Function —— SHA-RNN。其以海绵结构为基础,融合了混

Houde Qian 5 May 15, 2022
GAN Image Generator and Characterwise Image Recognizer with python

MODEL SUMMARY 모델의 구조는 크게 6단계로 나뉩니다. STEP 0: Input Image Predict 할 이미지를 모델에 입력합니다. STEP 1: Make Black and White Image STEP 1 은 입력받은 이미지의 글자를 흑색으로, 배경을

Juwan HAN 1 Feb 09, 2022
Intrusion Detection System using ensemble learning (machine learning)

IDS-ML implementation of an intrusion detection system using ensemble machine learning methods Data set This project is carried out using the UNSW-15

4 Nov 25, 2022
PyTorch Implementation of DSB for Score Based Generative Modeling. Experiments managed using Hydra.

Diffusion Schrödinger Bridge with Applications to Score-Based Generative Modeling This repository contains the implementation for the paper Diffusion

James Thornton 50 Jan 03, 2023
MatryODShka: Real-time 6DoF Video View Synthesis using Multi-Sphere Images

Main repo for ECCV 2020 paper MatryODShka: Real-time 6DoF Video View Synthesis using Multi-Sphere Images. visual.cs.brown.edu/matryodshka

Brown University Visual Computing Group 75 Dec 13, 2022
DI-smartcross - Decision Intelligence Platform for Traffic Crossing Signal Control

DI-smartcross DI-smartcross - Decision Intelligence Platform for Traffic Crossin

OpenDILab 213 Jan 02, 2023
A pytorch implementation of Detectron. Both training from scratch and inferring directly from pretrained Detectron weights are available.

Use this instead: https://github.com/facebookresearch/maskrcnn-benchmark A Pytorch Implementation of Detectron Example output of e2e_mask_rcnn-R-101-F

Roy 2.8k Dec 29, 2022
Deep Learning GPU Training System

DIGITS DIGITS (the Deep Learning GPU Training System) is a webapp for training deep learning models. The currently supported frameworks are: Caffe, To

NVIDIA Corporation 4.1k Jan 03, 2023