Authors

* External authors

Venue

Date

Share

TITAN-Guide: Taming Inference-Time Alignment for Guided Text-to-Video Diffusion Models

Christian Simon

Masato Ishii

Akio Hayakawa

Zhi Zhong*

Shusuke Takahashi*

Takashi Shibuya

Yuki Mitsufuji

* External authors

ICCV-25

2025

Abstract

In the recent development of conditional diffusion models still require heavy supervised fine-tuning for performing control on a category of tasks. Training-free conditioning via guidance with off-the-shelf models is a favorable alternative to avoid further fine-tuning on the base model. However, the existing training-free guidance frameworks either heavy memory requirements or sub-optimal control due to rough estimation. These shortcomings limit the applicability to control diffusion models that require intense computation, such as Text-to-Video (T2V) diffusion models. In this work, we propose Taming Inference Time Alignment for Guided Text-to-Video Diffusion Model, so-called TITAN-Guide, which overcomes memory space issues, and provides more optimal control in the guidance process compared to the counterparts. In particular, we develop an efficient method for optimizing diffusion latents without backpropagation from a discriminative guiding model. In particular, we study forward gradient descents for guided diffusion tasks with various options on directional directives. In our experiments, we demonstrate the effectiveness of our approach in efficiently managing memory during latent optimization, while previous methods fall short. Our proposed approach not only minimizes memory requirements but also significantly enhances T2V performance across a range of diffusion guidance benchmarks.

Related Publications

Transformed Low-rank Adaptation via Tensor Decomposition and Its Applications to Text-to-image Models

ICCV, 2025
Zerui Tao, Yuhta Takida, Naoki Murata, Qibin Zhao*, Yuki Mitsufuji

Parameter-Efficient Fine-Tuning (PEFT) of text-to-image models has become an increasingly popular technique with many applications. Among the various PEFT methods, Low-Rank Adaptation (LoRA) and its variants have gained significant attention due to their effectiveness, enabl…

A Comprehensive Real-World Assessment of Audio Watermarking Algorithms: Will They Survive Neural Codecs?

Interspeech, 2025
Yigitcan Özer, Woosung Choi, Joan Serrà, Mayank Kumar Singh*, Wei-Hsiang Liao, Yuki Mitsufuji

We introduce the Robust Audio Watermarking Benchmark (RAW-Bench), a benchmark for evaluating deep learning-based audio watermarking methods with standardized and systematic comparisons. To simulate real-world usage, we introduce a comprehensive audio attack pipeline with var…

Training Consistency Models with Variational Noise Coupling

ICML, 2025
Gianluigi Silvestri, Luca Ambrogioni, Chieh-Hsin Lai, Yuhta Takida, Yuki Mitsufuji

Consistency Training (CT) has recently emerged as a promising alternative to diffusion models, achieving competitive performance in image generation tasks. However, non-distillation consistency training often suffers from high variance and instability, and analyzing and impr…

  • HOME
  • Publications
  • TITAN-Guide: Taming Inference-Time Alignment for Guided Text-to-Video Diffusion Models

JOIN US

Shape the Future of AI with Sony AI

We want to hear from those of you who have a strong desire
to shape the future of AI.