Authors
- Yin-Jyun Luo
- Kin Wai Cheuk
- Woosung Choi
- Toshimitsu Uesaka
- Keisuke Toyama*
- Koichi Saito
- Chieh-Hsin Lai
- Yuhta Takida
- Wei-Hsiang Liao
- Simon Dixon
- Yuki Mitsufuji
* External authors
Venue
- NeurIPS-24
Date
- 2025
Disentangling Mixtures of Musical Instruments for Source-level Pitch and Timbre Manipulation
Yin-Jyun Luo
Kin Wai Cheuk
Woosung Choi
Keisuke Toyama*
Koichi Saito
Simon Dixon
* External authors
NeurIPS-24
2025
Abstract
Existing work on pitch and timbre disentanglement has been mostly focused on single-instrument music audio, excluding the cases where multiple instruments are presented. To fill the gap, we propose DisMix, a generative framework in which the pitch and timbre representations act as modular building blocks for constructing the melody and instrument of a source, and the collection of which forms a set of per-instrument latent representations underlying the observed mixture. By manipulating the representations, our model samples mixtures with novel combinations of pitch and timbre of the constituent instruments. We can jointly learn the disentangled pitch-timbre representations and a latent diffusion transformer that reconstructs the mixture conditioned on the set of source-level representations. We evaluate the model using both a simple dataset of isolated chords and a realistic four-part chorales in the style of J.S. Bach, identify the key components for the success of disentanglement, and demonstrate the application of mixture transformation based on source-level attribute manipulation.
Related Publications
Classifier-Free Guidance (CFG) is a widely used technique for conditional generation and improving sample quality in continuous diffusion models, and recent works have extended it to discrete diffusion. This paper theoretically analyzes CFG in the context of masked discrete …
We present 3DScenePrompt, a framework that generates the next video chunk from arbitrary-length input while enabling precise camera control and preserving scene consistency. Unlike methods conditioned on a single image or a short clip, we employ dual spatio-temporal conditio…
This paper introduces LLM2Fx-Tools, a multimodal tool-calling framework that generates executable sequences of audio effects (Fx-chain) for music post-production. LLM2Fx-Tools uses a large language model (LLM) to understand audio inputs, select audio effects types, determine…
JOIN US
Shape the Future of AI with Sony AI
We want to hear from those of you who have a strong desire
to shape the future of AI.



