Authors
- WeiHsiang Liao
- Yuhta Takida
- Yukara Ikemiya
- Zhi Zhong*
- Chieh-Hsin Lai
- Giorgio Fabbro*
- Kazuki Shimada
- Keisuke Toyama*
- Kinwai Cheuk
- Marco A. Martínez-Ramírez
- Shusuke Takahashi*
- Stefan Uhlich*
- Taketo Akama*
- Woosung Choi
- Yuichiro Koyama*
- Yuki Mitsufuji
* External authors
Venue
- TMLR
Date
- 2025
Music Foundation Model as Generic Booster for Music Downstream Tasks
WeiHsiang Liao
Yukara Ikemiya
Zhi Zhong*
Giorgio Fabbro*
Kazuki Shimada
Keisuke Toyama*
Kinwai Cheuk
Marco A. Martínez-Ramírez
Shusuke Takahashi*
Stefan Uhlich*
Taketo Akama*
Woosung Choi
Yuichiro Koyama*
* External authors
TMLR
2025
Abstract
We demonstrate the efficacy of using intermediate representations from a single foundation model to enhance various music downstream tasks. We introduce SoniDo, a music foundation model (MFM) designed to extract hierarchical features from target music samples. By leveraging hierarchical intermediate features, SoniDo constrains the information granularity, leading to improved performance across various downstream tasks including both understanding and generative tasks. We specifically evaluated this approach on representative tasks such as music tagging, music transcription, music source separation, and music mixing. Our results reveal that the features extracted from foundation models provide valuable enhancements in training downstream task models. This highlights the capability of using features extracted from music foundation models as a booster for downstream tasks. Our approach not only benefits existing task-specific models but also supports music downstream tasks constrained by data scarcity. This paves the way for more effective and accessible music processing solutions.
Submission Length: Regular submission (no more than 12 pages of main content)
Related Publications
Classifier-Free Guidance (CFG) is a widely used technique for conditional generation and improving sample quality in continuous diffusion models, and recent works have extended it to discrete diffusion. This paper theoretically analyzes CFG in the context of masked discrete …
We present 3DScenePrompt, a framework that generates the next video chunk from arbitrary-length input while enabling precise camera control and preserving scene consistency. Unlike methods conditioned on a single image or a short clip, we employ dual spatio-temporal conditio…
This paper introduces LLM2Fx-Tools, a multimodal tool-calling framework that generates executable sequences of audio effects (Fx-chain) for music post-production. LLM2Fx-Tools uses a large language model (LLM) to understand audio inputs, select audio effects types, determine…
JOIN US
Shape the Future of AI with Sony AI
We want to hear from those of you who have a strong desire
to shape the future of AI.



