Authors

* External authors

Venue

Date

Share

SoundCTM: Unifying Score-based and Consistency Models for Full-band Text-to-Sound Generation

Koichi Saito

Dongjun Kim*

Takashi Shibuya

Chieh-Hsin Lai

Zhi Zhong*

Yuhta Takida

Yuki Mitsufuji

* External authors

ICLR-25

2025

Abstract

Sound content creation, essential for multimedia works such as video games and films, often involves extensive trial-and-error, enabling creators to semantically reflect their artistic ideas and inspirations, which evolve throughout the creation process, into the sound. Recent high-quality diffusion-based Text-to-Sound (T2S) generative models provide valuable tools for creators. However, these models often suffer from slow inference speeds, imposing an undesirable burden that hinders the trial-and-error process. While existing T2S distillation models address this limitation through
-step generation, the sample quality of
-step generation remains insufficient for production use. Additionally, while multi-step sampling in those distillation models improves sample quality itself, the semantic content changes due to their lack of deterministic sampling capabilities. Thus, developing a T2S generative model that allows creators to efficiently conduct trial-and-error while producing high-quality sound remains a key challenge. To address these issues, we introduce Sound Consistency Trajectory Models (SoundCTM), which allow flexible transitions between high-quality
-step sound generation and superior sound quality through multi-step deterministic sampling. This allows creators to efficiently conduct trial-and-error with
-step generation to semantically align samples with their intention, and subsequently refine sample quality with preserving semantic content through deterministic multi-step sampling. To develop SoundCTM, we reframe the CTM training framework, originally proposed in computer vision, and introduce a novel feature distance using the teacher network for a distillation loss. Additionally, while distilling classifier-free guided trajectories, we introduce a
-sampling, a new algorithm that offers another source of quality improvement. For the
-sampling, we simultaneously train both conditional and unconditional student models. For production-level generation, we scale up our model to 1B trainable parameters, making SoundCTM-DiT-1B the first large-scale distillation model in the sound community to achieve both promising high-quality
-step and multi-step full-band (44.1kHz) generation.

Related Publications

TITAN-Guide: Taming Inference-Time Alignment for Guided Text-to-Video Diffusion Models

ICCV, 2025
Christian Simon, Masato Ishii, Akio Hayakawa, Zhi Zhong*, Shusuke Takahashi*, Takashi Shibuya, Yuki Mitsufuji

In the recent development of conditional diffusion models still require heavy supervised fine-tuning for performing control on a category of tasks. Training-free conditioning via guidance with off-the-shelf models is a favorable alternative to avoid further fine-tuning on th…

Transformed Low-rank Adaptation via Tensor Decomposition and Its Applications to Text-to-image Models

ICCV, 2025
Zerui Tao, Yuhta Takida, Naoki Murata, Qibin Zhao*, Yuki Mitsufuji

Parameter-Efficient Fine-Tuning (PEFT) of text-to-image models has become an increasingly popular technique with many applications. Among the various PEFT methods, Low-Rank Adaptation (LoRA) and its variants have gained significant attention due to their effectiveness, enabl…

A Comprehensive Real-World Assessment of Audio Watermarking Algorithms: Will They Survive Neural Codecs?

Interspeech, 2025
Yigitcan Özer, Woosung Choi, Joan Serrà, Mayank Kumar Singh*, Wei-Hsiang Liao, Yuki Mitsufuji

We introduce the Robust Audio Watermarking Benchmark (RAW-Bench), a benchmark for evaluating deep learning-based audio watermarking methods with standardized and systematic comparisons. To simulate real-world usage, we introduce a comprehensive audio attack pipeline with var…

  • HOME
  • Publications
  • SoundCTM: Unifying Score-based and Consistency Models for Full-band Text-to-Sound Generation

JOIN US

Shape the Future of AI with Sony AI

We want to hear from those of you who have a strong desire
to shape the future of AI.