VMM: Video-Music Mamba for Generating Background Music from Videos

1School of Computer Science and Technology, Xidian University, South Taibai Road No.2, Xi'an, 710071, Shaanxi, China
2Guangzhou Institute of Technology, Xidian University, Zhiming Road No.83, Guangzhou, 510555, Guangdong, China
3School of Statistics, Xi’an University of Finance and Economics, Changning Street, Xi’an, 710100, Shaanxi, China

Corresponding author

Abstract

This paper addresses the task of video background music generation, focusing on symbolic music generation. We propose a hybrid modeling approach that jointly considers both long-term and short-term musical structures. While existing research predominantly focuses on long-term patterns in music, the critical role of short-term structures in shaping musical expressiveness has been largely overlooked. Short-term structures provide intrinsic logic and emotional tension through dynamic variations and immediacy, synergizing with long-term structures to form cohesive musical narratives. To this end, we propose the VMM model, which innovatively integrates Mamba and Transformer: the Mamba module captures long-term dependencies via state space modeling, while the Transformer decodes short-term interactions among local notes through self-attention mechanisms. To further enhance multimodal understanding, we introduce Video-Music Generation Framework (VMGF), which incorporates a Switch Schedule mechanism that dynamically selects fusion strategies between video features and chord features during training and mitigates multimodal representation conflicts through gradient control. Experiments demonstrate that VMM achieves state-of-the-art performance in symbolic music generation tasks, excelling in metrics such as structural coherence and emotional consistency, thereby validating its leading capabilities in the field of symbolic music generation.

First image

Training pipeline of the VMGF

Fourth image

inference pipeline of the VMGF

Generation Results on the MuVi-Sync Dataset

The soundfont file has a significant impact on BGM generation. To ensure a fair comparison, we used the same soundfont file (default_sound_font.sf2).

Ground Truth

Generated Results

Ground Truth

Generated Results

Ground Truth

Generated Results