Video Background Music Generation with Controllable Music Transformer (ACM MM 2021 Oral)

Last update: Dec 27, 2022

Related tags

Deep Learning video-bgm-generation

Overview

CMT

Code for paper Video Background Music Generation with Controllable Music Transformer (ACM MM 2021 Best Paper Award)

[Paper] [Site]

Directory Structure

src/: code of the whole pipeline
- train.py: training script, take a npz as input music data to train the model
- model.py: code of the model
- gen_midi_conditional.py: inference script, take a npz (represents a video) as input to generate several songs
- src/video2npz/: convert video into npz by extracting motion saliency and motion speed
dataset/: processed dataset for training, in the format of npz
logs/: logs that automatically generate during training, can be used to track training process
exp/: checkpoints, named after val loss (e.g. loss_13_params.pt)
inference/: processed video for inference (.npz), and generated music(.mid)

Preparation

clone this repo
download lpd_5_prcem_mix_v8_10000.npz from HERE and put it under dataset/
download pretrained model loss_8_params.pt from HERE and put it under exp/
install ffmpeg=3.2.4
prepare a Python3 conda environment
```
pip install -r py3_requirements.txt
```
prepare a Python2 conda environment (for extracting visbeat)
- ```
pip install -r py2_requirements.txt
```
- open visbeat package directory (e.g. anaconda3/envs/XXXX/lib/python2.7/site-packages/visbeat), replace the original Video_CV.py with src/video2npz/Video_CV.py

Training

If you want to use another training set: convert training data from midi into npz under dataset/
```
python midi2numpy_mix.py --midi_dir /PATH/TO/MIDIS/ --out_name data.npz 
```

train the model

python train.py -n XXX -g 0 1 2 3

# -n XXX: the name of the experiment, will be the name of the log file & the checkpoints directory. if XXX is 'debug', checkpoints will not be saved
# -l (--lr): initial learning rate
# -b (--batch_size): batch size
# -p (--path): if used, load model checkpoint from the given path
# -e (--epochs): number of epochs in training
# -t (--train_data): path of the training data (.npz file) 
# -g (--gpus): ids of gpu
# other model hyperparameters: modify the source .py files

Inference

convert input video (MP4 format) into npz (use the Python2 environment)
```
cd src/video2npz
sh video2npz.sh ../../videos/xxx.mp4
```
- try resizing the video if this takes a long time

run model to generate .mid :

python gen_midi_conditional.py -f "../inference/xxx.npz" -c "../exp/loss_8_params.pt"

# -c: checkpoints to be loaded
# -f: input npz file
# -g: id of gpu (only one gpu is needed for inference)

if using another training set, change decoder_n_class in gen_midi_conditional to the decoder_n_class in train.py

convert midi into audio: use GarageBand (recommended) or midi2audio
- set tempo to the value of tempo in video2npz/metadata.json

combine original video and audio into video with BGM

ffmpeg -i 'xxx.mp4' -i 'yyy.mp3' -c:v copy -c:a aac -strict experimental -map 0:v:0 -map 1:a:0 'zzz.mp4'

# xxx.mp4: input video
# yyy.mp3: audio file generated in the previous step
# zzz.mp4: output video

Video Background Music Generation with Controllable Music Transformer (ACM MM 2021 Oral)

Related tags

Overview

CMT

Directory Structure

Preparation

Training

Inference

Owner

Zhaokai Wang

codes for "Scheduled Sampling Based on Decoding Steps for Neural Machine Translation" (long paper of EMNLP-2022)

Confidence Propagation Cluster aims to replace NMS-based methods as a better box fusion framework in 2D/3D Object detection

Image Completion with Deep Learning in TensorFlow

BossNAS: Exploring Hybrid CNN-transformers with Block-wisely Self-supervised Neural Architecture Search

An Unbiased Learning To Rank Algorithms (ULTRA) toolbox

BEAMetrics: Benchmark to Evaluate Automatic Metrics in Natural Language Generation

Recognize numbers from an (28 x 28) image using neural networks

Functional deep learning

Code for "Training Neural Networks with Fixed Sparse Masks" (NeurIPS 2021).

[ICCV21] Self-Calibrating Neural Radiance Fields

AlphaBot2 Pi Core software for interfacing with the various components.

A pre-trained model with multi-exit transformer architecture.

PClean: A Domain-Specific Probabilistic Programming Language for Bayesian Data Cleaning

Resco: A simple python package that report the effect of deep residual learning

Implementation of the paper "Fine-Tuning Transformers: Vocabulary Transfer"

UnsupervisedR&R: Unsupervised Pointcloud Registration via Differentiable Rendering

The ARCA23K baseline system

object detection; robust detection; ACM MM21 grand challenge; Security AI Challenger Phase VII

Apache Flink

The first dataset of composite images with rationality score indicating whether the object placement in a composite image is reasonable.