Making a music video with Wav2CLIP and VQGAN-CLIP

Last update: Dec 26, 2022

Related tags

Deep Learning music2video

Overview

music2video Overview

A repo for making a music video with Wav2CLIP and VQGAN-CLIP.

The base code was derived from VQGAN-CLIP The CLIP embedding for audio was derived from Wav2CLIP

Environment:

Tested on Ubuntu 20.04
GPU: Nvidia RTX 3090
Typical VRAM requirements:
- 24 GB for a 900x900 image
- 10 GB for a 512x512 image
- 8 GB for a 380x380 image

Set up

This example uses Anaconda to manage virtual Python environments.

Create a new virtual Python environment for VQGAN-CLIP:

conda create --name vqgan python=3.9
conda activate vqgan

Install Pytorch in the new enviroment:

Note: This installs the CUDA version of Pytorch, if you want to use an AMD graphics card, read the AMD section below.

pip install torch==1.9.0+cu111 torchvision==0.10.0+cu111 torchaudio==0.9.0 -f https://download.pytorch.org/whl/torch_stable.html

Install other required Python packages:

pip install ftfy regex tqdm omegaconf pytorch-lightning IPython kornia imageio imageio-ffmpeg einops torch_optimizer wav2clip

Or use the requirements.txt file, which includes version numbers.

Clone required repositories:

git clone 'https://github.com/nerdyrodent/VQGAN-CLIP'
cd VQGAN-CLIP
git clone 'https://github.com/openai/CLIP'
git clone 'https://github.com/CompVis/taming-transformers'

Note: In my development environment both CLIP and taming-transformers are present in the local directory, and so aren't present in the requirements.txt or vqgan.yml files.

As an alternative, you can also pip install taming-transformers and CLIP.

You will also need at least 1 VQGAN pretrained model. E.g.

mkdir checkpoints

curl -L -o checkpoints/vqgan_imagenet_f16_16384.yaml -C - 'https://heibox.uni-heidelberg.de/d/a7530b09fed84f80a887/files/?p=%2Fconfigs%2Fmodel.yaml&dl=1' #ImageNet 16384
curl -L -o checkpoints/vqgan_imagenet_f16_16384.ckpt -C - 'https://heibox.uni-heidelberg.de/d/a7530b09fed84f80a887/files/?p=%2Fckpts%2Flast.ckpt&dl=1' #ImageNet 16384

Note that users of curl on Microsoft Windows should use double quotes.

The download_models.sh script is an optional way to download a number of models. By default, it will download just 1 model.

See https://github.com/CompVis/taming-transformers#overview-of-pretrained-models for more information about VQGAN pre-trained models, including download links.

By default, the model .yaml and .ckpt files are expected in the checkpoints directory. See https://github.com/CompVis/taming-transformers for more information on datasets and models.

Run

To generate video from music, specify your music as shown in the example below:

python generate.py -vid -i 200 -vl 5 -o outputs/output.png -ap "music_sample/meeting_easy.wav" -gid 0

python generate.py -vid -i 200 -vl 5 -o outputs2/output.png -ap "music_sample/merry_go_round.wav" -gid 0

Citations

@misc{unpublished2021clip,
    title  = {CLIP: Connecting Text and Images},
    author = {Alec Radford, Ilya Sutskever, Jong Wook Kim, Gretchen Krueger, Sandhini Agarwal},
    year   = {2021}
}

@misc{esser2020taming,
      title={Taming Transformers for High-Resolution Image Synthesis}, 
      author={Patrick Esser and Robin Rombach and Björn Ommer},
      year={2020},
      eprint={2012.09841},
      archivePrefix={arXiv},
      primaryClass={cs.CV}
}

@article{wu2021wav2clip,
  title={Wav2CLIP: Learning Robust Audio Representations From CLIP},
  author={Wu, Ho-Hsiang and Seetharaman, Prem and Kumar, Kundan and Bello, Juan Pablo},
  journal={arXiv preprint arXiv:2110.11499},
  year={2021}
}

Making a music video with Wav2CLIP and VQGAN-CLIP

Related tags

Overview

music2video Overview

Set up

Run

Citations

Owner

Joel Jang | 장요엘

[TIP 2020] Multi-Temporal Scene Classification and Scene Change Detection with Correlation based Fusion

Companion code for "Bayesian logistic regression for online recalibration and revision of risk prediction models with performance guarantees"

Explaining Hyperparameter Optimization via PDPs

Generating retro pixel game characters with Generative Adversarial Networks. Dataset "TinyHero" included.

On Nonlinear Latent Transformations for GAN-based Image Editing - PyTorch implementation

[NeurIPS 2021] ORL: Unsupervised Object-Level Representation Learning from Scene Images

Deep Ensemble Learning with Jet-Like architecture

VQGAN+CLIP Colab Notebook with user-friendly interface.

Mining-the-Social-Web-3rd-Edition - The official online compendium for Mining the Social Web, 3rd Edition (O'Reilly, 2018)

Facial detection, landmark tracking and expression transfer library for Windows, Linux and Mac

Use tensorflow to implement a Deep Neural Network for real time lane detection

Inference pipeline for our participation in the FeTA challenge 2021.

基于PaddleClas实现垃圾分类，并转换为inference格式用PaddleHub服务端部署

A GUI for Face Recognition, based upon Docker, Tkinter, GPU and a camera device.

Deep Learning and Reinforcement Learning Library for Scientists and Engineers 🔥

Constraint-based geometry sketcher for blender

Use unsupervised and supervised learning to predict stocks

These are the materials for the paper "Few-Shot Out-of-Domain Transfer Learning of Natural Language Explanations"

Contenido del curso Bases de datos del DCC PUC versión 2021-2

kullanışlı ve işinizi kolaylaştıracak bir araç