On Writing The Principles of Diffusion Models, A Q&A With Sony AI Researcher, Jesse Lai

Imaging & Sensing

Sony AI

March 5, 2026

Introduction

Diffusion models have become a go-to approach for high-quality generation; however, the field can be challenging to navigate once the paper titles and acronyms begin to pile up. In this interview Q&A, Chieh-Hsin “Jesse” Lai (Sony AI) walks through why he and his co-authors wrote “The Principles of Diffusion Models,” what they wanted readers to be able to do after finishing it, and how they think about what “lasts” in a research area that keeps reinventing itself.

The goal for the book is to give people a stable map through that jungle—something you can come back to as the techniques evolve—because “AI moves incredibly fast… But the underlying principles and fundamental ideas tend to last.”

Q: Why did you decide to write this book? And why does the industry need it?

A: The initial draft grew out of the lecture notes from an invited summer course I taught at National Chiao Tung University in Taiwan. Afterward, I felt it would be a waste to let the notes just sit there, so I spoke with Yang Song (then at OpenAI) and Professor Stefano Ermon (Stanford)—both pioneers in diffusion models—about making the material public. They had been thinking along similar lines and were very supportive of the idea. After that, Yang, Dongjun—who was my intern and later joined Stefano’s group as a postdoc at the time—Stefano, and I worked very hard for about a year, iterating many times to refine the draft and make it as complete and polished as possible.

Q: Who was the audience you were intending the book for?

A: I think the audience could be quite wide. We are expecting it can be kind of a guideline or textbook for senior students in college or even grad students and even for junior researchers who want to have a handbook they can use to check back to our book instead of reading back to multiple papers. So we were targeting a wide range of the audience and wanted to maximize the usage of the book.

Q: Do you envision it being a living book that evolves over time? As research evolves will you continuously update it?

A: Oh yes, definitely. To be honest, I don't want to touch the overall structure too much because I believe those materials in the books are more like principles and principle means they are time varying, like a time lesson. The overall structure won't be touched but we definitely want to add bit by bit over time and try to make it a long term book.

Q: It can sometimes feel like a “jungle of acronyms.” What was the moment you realized this field needed a clearer, more unified explanation?

A: I started to feel the “jungle of acronyms” problem quite early—especially around 2021–2022—when diffusion models began to bloom rapidly and people from many different backgrounds (statistics, physics, applied math, signal processing, etc.) jumped in and pushed the field forward. That diversity has been a huge reason for the progress, but it also made the landscape fragment quickly: different communities reinvented similar ideas with different names, notations, and narratives.

The real turning point for me was when I had to explain diffusion models in many different settings—invited talks, conference tutorials, summer courses, and then just mentoring students and chatting with researchers day to day. I kept hearing the same questions from really strong people: even strong students and researchers would get stuck on the same questions.

Which papers are actually saying the same thing under different names?
What’s the underlying principle behind all these variants?

And why do we end up with so many objectives and samplers that look different but feel closely related?

At the same time, there were already excellent materials, tutorials and blog posts—Meta’s Flow Matching guide and code is a great example—but much of the existing material still felt like a re-organization of the original papers, rather than a unified explanation that extracts the most fundamental principles and rubrics. And the papers themselves, while brilliant, are often written “locally”: each introduces its own notation and motivations tailored to that specific contribution, which makes the field hard to navigate as a whole.

Because I come from a math background, I’m always looking for the simplest “core story.” And honestly, I think diffusion models have a very simple one. Generative modeling is basically: you start from random noise, and you learn how to turn it into something realistic.

Q: If a reader could walk away remembering just one core idea about diffusion models, what do you hope that idea is?

A: If I had to boil diffusion models down to one core idea, it’s this: generation is learned as a gradual transformation from noise to data.

The key trick diffusion introduces is that we first set up a forward corruption process—we intentionally keep adding a little bit of noise to real data until it becomes something simple and universal, basically white noise. That gives us a common, easy-to-handle space.

Then the model learns the reverse process: how to undo that corruption step by step, gradually turning noise back into realistic samples. You can think of it as learning time-varying dynamics, like a flow that moves probability mass from noise toward the data distribution.

That’s quite different from other generative models like VAEs, GANs, or autoregressive models used in LLMs, which often try to generate everything in one shot or token-by-token. Diffusion’s strength is this coarse-to-fine generation: it breaks a very hard problem into many small, manageable steps. And honestly, that’s also how humans create—we start with a rough sketch, then refine it into something detailed.

Q: You show that DDPMs, score-based models, and flow-based models are really different views of the same underlying process. Why do you think the field evolved with so many names before this connection became clear?

A: To me, that’s a pretty natural way for research to evolve. In the beginning, a clever idea shows up, and it works well, but people don’t fully understand the full picture yet. Then, as more researchers from different backgrounds jump in, they bring their own language and mental models. So the same underlying concept gets rediscovered and described from different angles, with different names.

Over time, as results accumulate and people compare perspectives, the connections become clearer, and the unified structure starts to emerge.

The real “click” moment was late 2020, when Yang unified everything with the continuous-time differential equation view—the Score-SDE framework. I think it was a key milestone because it showed that DDPM diffusion, score-based modeling, and continuous-time dynamics are basically the same story, just written in different languages.

More importantly, Score-SDE brought diffusion back to classic math—differential equations. And once you see it that way, you can leverage a huge toolbox from numerical analysis to make generation faster and more efficient. So in hindsight it looks messy, but it’s actually pretty natural: different communities built different “maps” of the same landscape, and only after the field matured did we get a clean unified picture.

Q: Many readers say the change-of-variables perspective helped the math finally “click.” Why do you think this classical calculus idea turns out to be so central to modern generative models?

A: Yeah, I think it clicks for people because the change-of-variables idea is really just about reshaping space. It tells you: if you move points around, how does the overall “shape” or distribution change?

And that’s basically what generative models do. You start with something simple—like a cloud of random points, or white noise—and you want to transform it into something structured, like images, audio, or text patterns. So at a high level, generative modeling is literally a problem of reshaping a simple cloud into a complicated one. That’s why this classical calculus idea ends up being so central—it’s the cleanest language for describing “how randomness turns into structure.”

Q: You spend a lot of time emphasizing that diffusion models are not mysterious and that the math is actually quite familiar. Was demystifying the field an explicit goal while writing?

A: We wanted everything to connect back to a few simple, familiar ideas. For example, at a high level, diffusion models are really about reshaping a cloud of random points into a cloud of real data—and that connects naturally to classic tools like change of variables.

Our goal is to explain the intuition clearly and also trace where the key ideas originally came from.

Another big part of demystifying diffusion models is showing that these three origin stories—variational, score-based, and flow-based—often end up using the same core training idea, even if the papers describe it differently. In practice, training a diffusion model can look like a normal regression problem. The only twist is that the ideal target you’d want to train on is hard to compute directly, so people use an equivalent target that is easy to compute, and it teaches the model the right behavior.

This idea showed up early in the score-based literature; it’s very explicit in flow matching, and it’s also hidden—more implicitly—in DDPM’s formulation. One thing we do in the book, especially in Chapter 6, is make that shared structure completely clear. So instead of feeling like three disconnected worlds, readers can see they’re tightly connected and built on the same core principles—just expressed in different formulations.

Q: In the book you talk about how flow map models aim to replace many small steps with a few large jumps. Conceptually, what excites you most about this shift in how we think about generation?

A: Diffusion models’ sampling speed has been a major research problem since around 2021. And broadly, I think the field has been moving along two big trends.

The first trend is: if you take the differential equation perspective, then a natural way to accelerate sampling is to improve the numerical solvers—better discretization, better integration methods, fewer steps. That’s also why I think the Score-SDE work was so impactful: it really bridges modern generative modeling with classic tools from differential equations and numerical analysis.

The second trend is more ambitious: instead of solving the differential equation step by step, can we learn the solution map directly? In other words, once you have the starting point, can you take a long jump to the final result? There have been great works in this direction, like Progressive Distillation from Google Brain in 2022. And then in 2023, Yang at OpenAI pioneered Consistency Models, which is one of the first flow-map style approaches that learns a big jump—from a noisy input directly to a clean, realistic sample—and it can even be trained from scratch.

And today, the community is pushing even further toward standalone few-step generative models—models that are easy to train, can generate fast with high fidelity, and still give users strong controllability. I think this shift is very natural, and it mirrors how generative AI has evolved over the past decades. Ultimately, these faster and more controllable models will enable more interactive applications, like real-time generative systems and even world simulators that can respect physics and respond to actions.

Q: What underappreciated ideas from the book do you want to point out?

A: I’d say there are two underappreciated ideas in the later chapters.

First is the story behind fast samplers. Around 2021, there was a huge wave of work aiming to speed up diffusion sampling—methods like DDIM, DPM-Solver, and many others. What I find underappreciated is that a lot of these advances can be beautifully interpreted through the lens of classical numerical analysis—things like Euler, Heun, and higher-order solvers—adapted to the specific structure and parameterization of diffusion models. In the book, we put effort into making those connections explicit, so readers can better understand how these methods relate, what assumptions they make, and why they work.

The second underappreciated idea is flow-map models. After Consistency Models appeared, this line of work has grown extremely fast. But because it’s moving so quickly, many papers end up rephrasing or rebranding similar core ideas. So in the flow-map chapters, we try to extract a clean and intuitive template: how flow-map models naturally stem from diffusion models, what problem they’re really solving, and how different variants fit into one unified picture. That’s something I think many readers find surprisingly clarifying.

Q: This book balances intuition, math, and system-level thinking. How did you decide what to explain visually or conceptually versus what to relegate to appendices and proofs?

A: It wasn’t just my decision. My co-authors—Yang, Dongjun, and Stefano—and I had a lot of intense discussions about how to make everything as simple and intuitive as possible. Even though I come from a math background, I actually don’t like overly complicated formulas. That doesn’t mean we want to sacrifice rigor—but I do believe intuition should come first. So we worked hard to extract the core principles, provide clear intuition and insights, visualize the key concepts, and keep the more advanced math as supplementary material for readers who want to go deeper.

Q: Looking ahead, do you see diffusion models as a single dominant paradigm/standard or more as a flexible principle that will keep evolving into new forms we haven’t named yet?

A: I see diffusion models less as a single “final” paradigm, and more as a flexible principle that will keep evolving. Research won’t stop—people’s curiosity won’t stop. And just like diffusion models built on lessons from earlier generations of models like VAEs and GANs, future approaches will build on what diffusion and flow-map models have taught us.

To me, a key takeaway from diffusion models is the forward–reverse perspective: you map data into a simple, well-understood space—like noise—as an anchor, and then learn how to transform it back into realistic data, almost like a coarse-to-fine process in human creation. It’s not that every future model has to follow this exact template, but I think this viewpoint will remain a useful foundation and will likely inspire new designs.

Conclusion

A lot of diffusion discourse focuses on results or sampling tricks, but Jesse’s framing is more durable: “Applications change, but the principles are enduring.” That’s the heartbeat of the book and this interview—i.e., making the field navigable by tracing its shared structure, and showing how modern variants converge on the same familiar math.

Just as importantly, the interview makes clear where the field’s momentum is headed: toward fewer-step generation that stays high-fidelity and controllable, because “faster and more controllable models will enable more interactive applications, like real-time generative systems and even world simulators that can respect physics and respond to actions.”

To learn more about this essential book for your AI library, visit: https://the-principles-of-diffusion-models.github.io/

Latest Blog

March 2, 2026 | Sony AI

Advancing AI: Highlights from February

February at Sony AI was defined by momentum across global stages, research publications, and conversations about how AI moves from theory into practice.This month spanned responsib…

February 2, 2026 | Sony AI

Advancing AI: Highlights from January

January set the tone for the year ahead at Sony AI, with work that spans foundational research, scientific discovery, and global engagement with the research community.This month’s…

January 30, 2026 | Sony AI

Sony AI’s Contributions at AAAI 2026

Sony AI’s Contributions at AAAI 2026AAAI 2026 is a reminder that progress in AI isn’t one straight line. This year’s Sony AI contributions span improving and enhancing continual le…

SEE ALL

HOME
Blog
On Writing The Principles of Diffusion Models, A Q&A With Sony AI Researcher, Jesse Lai

JOIN US

Shape the Future of AI with Sony AI

We want to hear from those of you who have a strong desire
to shape the future of AI.

LEARN MORE