Venue

Date

Share

HQ-VAE: Hierarchical Discrete Representation Learning with Variational Bayes

Yuhta Takida

Yukara Ikemiya

Takashi Shibuya

Kazuki Shimada

Woosung Choi

Chieh-Hsin Lai

Naoki Murata

Toshimitsu Uesaka

Kengo Uchida

Yuki Mitsufuji

Wei-Hsiang Liao

TMLR-2024

2024

Abstract

Vector quantization (VQ) is a technique to deterministically learn features with discrete codebook representations. It is commonly performed with a variational autoencoding model, VQ-VAE, which can be further extended to hierarchical structures for making high-fidelity reconstructions. However, such hierarchical extensions of VQ-VAE often suffer from the codebook/layer collapse issue, where the codebook is not efficiently used to express the data, and hence degrades reconstruction accuracy. To mitigate this problem, we propose a novel unified framework to stochastically learn hierarchical discrete representation on the basis of the variational Bayes framework, called hierarchically quantized variational autoencoder (HQ-VAE). HQ-VAE naturally generalizes the hierarchical variants of VQ-VAE, such as VQ-VAE-2 and residual-quantized VAE (RQ-VAE), and provides them with a Bayesian training scheme. Our comprehensive experiments on image datasets show that HQ-VAE enhances codebook usage and improves reconstruction performance. We also validated HQ-VAE in terms of its applicability to a different modality with an audio dataset.



Related Publications

Diffusion-based Signal Refiner for Speech Enhancement and Separation

IEEE, 2026
Ryosuke Sawata, Masato Hirano*, Naoki Murata, Shusuke Takahashi*, Yuki Mitsufuji

Although recent speech processing technologies have achieved significant improvements in objective metrics, there still remains a gap in human perceptual quality. This paper proposes Diffiner, a novel solution that utilizes the powerful generative capability of diffusion mod…

PAVAS: Physics-Aware Video-to-Audio Synthesis

CVPR, 2026
Oh Hyun-Bin, Yuhta Takida, Toshimitsu Uesaka, Tae-Hyun Oh, Yuki Mitsufuji

Recent advances in Video-to-Audio (V2A) generation have achieved impressive perceptual quality and temporal synchronization, yet most models remain appearance-driven, capturing visual-acoustic correlations without considering the physical factors that shape real-world sounds…

Theory-Informed Improvements to Classifier-Free Guidance for Discrete Diffusion Models

ICLR, 2026
Kevin Rojas, Ye He, Chieh-Hsin Lai, Yuhta Takida, Yuki Mitsufuji, Molei Tao

Classifier-Free Guidance (CFG) is a widely used technique for conditional generation and improving sample quality in continuous diffusion models, and recent works have extended it to discrete diffusion. This paper theoretically analyzes CFG in the context of masked discrete …

  • HOME
  • Publications
  • HQ-VAE: Hierarchical Discrete Representation Learning with Variational Bayes

JOIN US

Shape the Future of AI with Sony AI

We want to hear from those of you who have a strong desire
to shape the future of AI.