Authors
- Dongjun Kim*
- Chieh-Hsin Lai
- Wei-Hsiang Liao
- Yuhta Takida
- Naoki Murata
- Toshimitsu Uesaka
- Yuki Mitsufuji
- Stefano Ermon*
* External authors
Venue
- NeurIPS 2024
Date
- 2024
PaGoDA: Progressive Growing of a One-Step Generator from a Low-Resolution Diffusion Teacher
Dongjun Kim*
Stefano Ermon*
* External authors
NeurIPS 2024
2024
Abstract
To accelerate sampling, diffusion models (DMs) are often distilled into generators that directly map noise to data in a single step. In this approach, the resolution of the generator is fundamentally limited by that of the teacher DM. To overcome this limitation, we propose Progressive Growing of Diffusion Autoencoder (PaGoDA), a technique to progressively grow the resolution of the generator beyond that of the original teacher DM. Our key insight is that a pre-trained, low-resolution DM can be used to deterministically encode high-resolution data to a structured latent space by solving the PF-ODE forward in time (data-to-noise), starting from an appropriately down-sampled image. Using this frozen encoder in an auto-encoder framework, we train a decoder by progressively growing its resolution. From the nature of progressively growing decoder, PaGoDA avoids re-training teacher/student models when we upsample the student model, making the whole training pipeline much cheaper. In experiments, we used our progressively growing decoder to upsample from the pre-trained model's 64x64 resolution to generate 512x512 samples, achieving 2x faster inference compared to single-step distilled Stable Diffusion like LCM. PaGoDA also achieved state-of-the-art FIDs on ImageNet across all resolutions from 64x64 to 512x512. Additionally, we demonstrated PaGoDA's effectiveness in solving inverse problems and enabling controllable generation.
Related Publications
Although recent speech processing technologies have achieved significant improvements in objective metrics, there still remains a gap in human perceptual quality. This paper proposes Diffiner, a novel solution that utilizes the powerful generative capability of diffusion mod…
Recent advances in Video-to-Audio (V2A) generation have achieved impressive perceptual quality and temporal synchronization, yet most models remain appearance-driven, capturing visual-acoustic correlations without considering the physical factors that shape real-world sounds…
MeanFlow (MF) is a diffusion-motivated generative model that enables efficient few-step generation by learning long jumps directly from noise to data. In practice, it is often used as a latent MF by leveraging the pre-trained Stable Diffusion variational autoencoder (SD-VAE)…
JOIN US
Shape the Future of AI with Sony AI
We want to hear from those of you who have a strong desire
to shape the future of AI.



