What DeepCache Accidentally Tells Us About How Diffusion Models Think

There's a paper from late 2023 called DeepCache: Accelerating Diffusion Models for Free (Ma, Fang, Wang — NUS, CVPR 2024). On the surface it's an inference-optimization paper: it makes Stable Diffusion 2.3× faster and LDM-4-G up to 7× faster, training-free, with negligible quality loss. That's a useful result, and most coverage of the paper stops there.

I want to argue the speedup is the boring part.

Buried in the methodology is an empirical finding that I think is more interesting than the algorithm built on top of it: across virtually every diffusion model the authors tested, the internal feature maps at adjacent denoising steps are >95% identical. In some configurations, a single timestep's features are >95% similar to 80% of the other timesteps in the trajectory. The model is, in a real sense, computing the same thing over and over.

Once you sit with that fact, a question opens up that the paper doesn't really pursue: what does this tell us about what diffusion models are actually doing? The speedup is a corollary. The finding itself is the thing worth reading the paper for. This post is my attempt to take it seriously.

0. If You're New: A Three-Minute Tour

If you already know what a U-Net, DDIM, and a denoising step are, skip to §1.

A diffusion model generates an image by reversing a noising process. You start with pure random Gaussian noise — a meaningless static image — and run it through a neural network many times. Each pass nudges the image one small step away from noise and toward something coherent. After 20 to 250 passes (depending on the sampler), what comes out the other end looks like a real photograph, painting, or whatever the model was trained on.

The neural network doing this denoising is almost always a U-Net: an encoder-decoder architecture with shortcut connections — called skip connections — that link matching layers on the encoder and decoder sides. The encoder squeezes the image down to an abstract representation; the decoder builds it back up. The skip connections let detailed information bypass the squeeze.

The thing that makes diffusion models slow is that every one of those 20–250 steps requires a full forward pass through the U-Net. There's no parallel shortcut — step t-1 depends on step t's output. If the U-Net is large (Stable Diffusion's is ~860 million parameters), each pass is expensive, and you're doing dozens of them per image.

That's the setup. Speeding diffusion models up is a research subfield, and DeepCache is one entry in it. But the reason DeepCache works is what this post is actually about.

This post assumes you've at least heard of "denoising steps" and have a vague sense of what a U-Net looks like. If not, the original DDPM paper and the U-Net Wikipedia page are about thirty minutes of reading and will make everything below land harder.

1. The Finding the Paper Doesn't Quite Frame as a Finding

Most of the DeepCache paper is structured as a methods paper: here's a problem (slow inference), here's an observation (redundancy), here's our algorithm, here are the speedups. The observation is presented as motivation — a stepping stone to the method.

I think it deserves to be read the other way around.

The authors did something simple. They took several pre-trained diffusion models — Stable Diffusion v1.5, LDM-4-G on ImageNet, DDPM on LSUN-Bedroom and LSUN-Churches — and during inference they recorded the activations at one of the upsampling blocks of the U-Net. Then they computed the cosine similarity of those activations between every pair of denoising steps.

The result is the heatmap in Figure 2(b) of the paper. It's mostly hot. For Stable Diffusion v1.5, the activations at most pairs of nearby steps are >95% similar. The same holds for LDM-4-G. For DDPM on the LSUN datasets, some timesteps are >95% similar to as much as 80% of all other timesteps.

Read that again, because it's the load-bearing fact of this entire post: the internal representation that the model computes at one denoising step is, at many points in the trajectory, almost indistinguishable from the representation it computes ten or twenty steps earlier or later.

If you came to this paper looking for a speedup trick, this is where you start designing one. If you came looking to understand what diffusion models are doing, this is where you stop and ask a different question.

2. What This Could Mean: Three Readings

Empirical findings of this shape don't have a single interpretation. There are at least three readings of "internal features barely change across denoising steps," and they make different predictions about what we should expect to see if we look harder. I want to walk through all three before committing to one.

Reading A: The model is doing busywork.

The simplest reading is that the model is over-parameterized for the task it's been given. Diffusion models are trained to handle the full noise schedule, which spans pure noise to clean image — but at any individual step, the model is doing a much smaller job (remove a tiny amount of noise). It has more capacity than it needs at each step, so it spends a lot of that capacity computing things that don't actually drive the output.

Under this reading, the redundancy is essentially a receipt for waste. A leaner model, or a model dynamically sized per step, would show less of it.

This is a reasonable thing to think, and there's research that broadly supports it — pruning and distillation work on diffusion models, which is some evidence that they're carrying weight they don't strictly need. But it doesn't fit the structure of what DeepCache observed. If the model were uniformly bloated, you'd expect the redundancy to be roughly uniform across the denoising trajectory. It isn't. The paper itself notes that for LDM-4-G, feature similarity drops sharply around the 40% mark — there's a region of the trajectory where the model is doing real, non-redundant work. Bloat doesn't predict that. Reading A captures something but isn't the full story.

Reading B: Of course it's redundant — denoising is a smooth process.

The second reading is more sophisticated and almost the opposite. From the math, the reverse diffusion process is the discretization of a stochastic differential equation. It's supposed to be smooth. Adjacent points on a smooth trajectory are by definition close to each other. The fact that the network's internal representation changes slowly between consecutive steps is just a reflection of the underlying continuous-time process being smooth.

Under this reading, DeepCache's observation isn't surprising at all — it's a property the model is expected to have if it's been trained correctly. Any well-trained diffusion model should exhibit it. There's no deeper insight to extract; the finding is just confirmation that the math is doing what the math says.

This reading is partially true, and it's the reading I'd expect a more theoretically-minded reader to reach for first. But it's almost too pat. It explains the average behavior of the trajectory but doesn't explain why the redundancy is concentrated in high-level features specifically — the deep, abstract activations near the bottleneck of the U-Net. The skip-connection features (the shallow, detail-carrying activations) change much more from step to step. If smoothness alone explained the finding, you'd expect the redundancy to be evenly distributed across the network's depth. It isn't. Something more specific is going on at the high-level layers.

Reading C: The model decides what to draw early, then spends the rest of the time deciding how cleanly to draw it.

The reading I find most interesting — and the one this post is going to commit to — is that the asymmetry between high-level and low-level features is telling us something specific about how the model has organized its work.

The high-level features are computed near the bottleneck of the U-Net. In a trained network, these are the layers that encode abstract, semantic content: roughly, what's in the image. Subject, composition, layout, basic shapes. The low-level features — carried by the skip connections from shallow layers — encode the local, textural, noise-shaped content: where the brushstrokes go, how clean each pixel is.

If you accept that decomposition, the empirical finding has a natural reading: the high-level decisions get made early in the denoising process and don't change much after that. The rest of the trajectory is the model refining textures and resolving noise around a semantic scaffold that locked in long before the image looked like anything.

Three things about Reading C are worth dwelling on.

First, it explains the dip. The 40% mark in LDM-4-G isn't a random anomaly — it's plausibly the region of the trajectory where the semantic content is still being negotiated. Before that point, the image is too noisy for the model to commit; after it, the commitment is locked and the work shifts to refinement. If you've ever watched a Stable Diffusion image generate from random noise, there's a moment somewhere in the middle where the picture snaps from looking like turbulent fog to looking like a recognizable subject. Reading C predicts this. The other readings don't.

Second, it explains the layer asymmetry. High-level features are stable across most steps because the semantic decision is stable across most steps. Low-level features change continuously because the texture is being refined continuously. It's not that the deep layers are bloated; it's that they have a different job, and that job concludes early.

Third, it gives us a framing for what "iterative refinement" actually means in a diffusion model — and the answer is less symmetric than the math suggests. The trajectory isn't a uniform crawl from noise to image. It's a brief negotiation phase where semantics are decided, followed by a long refinement phase where details are filled in around a fixed plan. The mathematical formulation of diffusion treats every timestep equally; the trained model, in practice, doesn't.

The redundancy isn't evidence of a bloated model. It's evidence of a model whose work is structurally uneven across the trajectory — concentrated in the early phase where semantic decisions are made, and sparse in the long tail where it's just texture cleanup. DeepCache's speedup comes from exploiting that unevenness.

I should be honest: this is an interpretation, not something the paper proves. The authors don't make this claim. The cosine-similarity numbers are consistent with Reading C but they don't establish it. To do that you'd want to vary the prompt or class label and check whether high-level features stabilize at the same point regardless, or you'd want to probe the high-level layers and see whether they encode the kind of semantic content this reading predicts. That work hasn't been done in this paper.

But Reading C does something the other readings don't: it makes new predictions. And it connects DeepCache to a body of empirical observations that diffusion-model practitioners have been collecting informally for years.

3. The Empirical Folklore That Reading C Explains

If you spend time with Stable Diffusion or any other open diffusion model, you start to notice things that aren't in the papers but that everyone who plays with these models seems to agree on.

The structure-emerges-first phenomenon. If you visualize the partial outputs of a diffusion model as it denoises, the rough composition of the final image — where the subject is, what color zones go where, what general shape the main object takes — appears very early in the trajectory, often within the first 20% of steps. Refinement of detail happens over the remaining 80%. This is exactly what Reading C predicts: the semantic scaffold is built early, the rest is texture work.

Early-step prompt sensitivity. Practitioners who use techniques like prompt scheduling — changing the prompt partway through generation — have observed that prompt changes in the early steps dramatically alter the image, while prompt changes in the late steps barely do anything. The semantic decisions are being made early. After they're locked in, the model has stopped listening for "what to draw" and is only working on "how to draw it."

Why CFG matters more in the early steps. Classifier-free guidance — the trick that gives Stable Diffusion its ability to follow prompts well — is empirically more impactful in early steps than late ones. There are even samplers that disable CFG late in the trajectory to save compute with no quality cost. Under Reading C, this is exactly right: CFG is a tool for steering semantic decisions, and the semantic decisions are happening early.

None of these observations come from the DeepCache paper. None of them are formal results. But they all line up with Reading C and they don't line up cleanly with Reading A or B. When an interpretation explains both the formal finding and a pile of community folklore that wasn't part of the original observation, that's evidence the interpretation is doing real work.

The most underrated kind of paper is one that turns folk knowledge into a measurable claim. DeepCache does this without quite realizing it: a thing practitioners had been noticing informally for years gets quantified for the first time as a cosine-similarity heatmap. The speedup is what the authors built; the measurement is what they contributed.

4. The U-Net Is the Reason This Works At All

There's one more piece of the picture worth pausing on: the U-Net architecture is unusually well-suited to exploiting this redundancy, and the paper's algorithm depends on that fact in a way that's easy to miss.

A U-Net has two pathways for data:

        ┌──── Skip Branch (shallow, cheap) ─────┐
        │                                       ▼
   Input ──▶ D1 ──▶ D2 ──▶ D3 ──▶ M ──▶ U3 ──▶ U2 ──▶ U1 ──▶ Output
                                  │
                            (Main Branch — deep, expensive)

The main branch goes all the way through the encoder, the bottleneck, and back up. It's where most of the FLOPs live. The skip branches are cheap shortcuts from each encoder block to its matching decoder block. They were originally introduced (in the 2015 U-Net paper for biomedical segmentation) so that fine-grained spatial information could survive the bottleneck without being squeezed out.

The accidental gift to diffusion is that this same architecture cleanly separates two kinds of features: deep, abstract, expensive ones along the main branch, and shallow, detailed, cheap ones along the skip branches. Reading C maps onto this architecture beautifully: the deep features carry the semantic decisions (which are stable), the shallow features carry the textural refinement (which changes step by step).

DeepCache's algorithm exploits exactly this. At step t, it runs the full U-Net and stashes the deep features. At step t-1, it runs only the shallowest encoder and decoder blocks — D1 and U1 — and feeds them the cached deep features instead of recomputing them. The middle of the network, where the expensive-but-stable work lives, is just skipped:

   Step t  (FULL):     D1 → D2 → D3 → M → U3 → U2 → U1 ──▶ Output_t
                                          │
                                          └── stash F_cache from U2

   Step t-1 (CACHED):  D1 ──────────────────────► U1 ──▶ Output_{t-1}
                       └─ concat with F_cache ─┘
                          (D2, D3, M, U3, U2  ← all skipped)

The full version of this idea is called 1:N caching: do one full step, then N-1 cached steps, then refresh. With N=2 the speedup is modest and the quality cost is essentially zero. With N=5 the speedup is substantial and the cost is small. With N=10 or N=20 you get the flashy headline numbers but quality starts to visibly degrade — because at those intervals, you're caching deep features past the point where they're still valid.

The authors also propose a non-uniform version of the schedule: full-inference steps are spaced more densely around the 40% region (where features are changing fastest) and more sparsely at the smooth ends. Under Reading C, this is exactly what you'd want — spend more compute where the semantic negotiation is happening, less where it's settled.

The non-uniform schedule is a small part of the paper but it's the strongest indirect evidence for Reading C. The fact that quality is preserved when you concentrate compute around the dip — and not when you concentrate it elsewhere — suggests the dip really is where the model is doing its semantic work.

5. The Speedup, Briefly

For completeness — though I've been arguing it's the less interesting half of the paper — here are the headline numbers.

On Stable Diffusion v1.5 with the PLMS sampler, DeepCache at N=5 achieves a 2.11× speedup with the CLIP score essentially unchanged (30.23 vs 30.30 baseline) — better than 25-step PLMS, better than the BK-SDM distilled models that required millions of training images.

On LDM-4-G with DDIM, the tradeoff curve is wide: 1.88× speedup with FID unchanged, 4.12× with FID barely worse, 6.96× with FID notably worse, and 10.54× when the wheels start to come off. The "7×" speedup the title implies and the abstract emphasizes corresponds to the second-to-last row of the table — the one where quality is starting to degrade. The honest summary is 2-4× essentially for free, and a steeper tradeoff beyond that.

Importantly, DeepCache composes with fast samplers. It attacks per-step cost; DDIM and PLMS attack step count. The speedups multiply. This is the most practically useful property of the method and the one most likely to survive the architecture shifts I'm about to discuss.

6. The Pattern Is Bigger Than This Paper

Step back from diffusion models and DeepCache fits into a larger pattern that has been showing up across the deep learning landscape for the last few years: trained networks contain large amounts of recoverable redundancy, and you can exploit it without retraining if you look in the right place.

The pattern shows up in large language model inference as KV caching, the standard technique that makes autoregressive generation tractable. KV caching is a different mechanism — it caches attention keys and values for tokens that won't change as the sequence grows — but the underlying insight is the same: a sequential computation has redundant structure across timesteps, and you can avoid paying for it twice if you store the right intermediate result.

It shows up in Mixture-of-Depths and early-exit models, where the observation is that not every input needs every layer. Easy tokens can leave the network early; hard tokens go all the way through. The redundancy here is across inputs rather than across timesteps, but the underlying structure is the same: the model's full capacity is overkill for the average case, and recognizing which cases need it is most of the win.

It shows up in token merging for vision transformers, where the observation is that not every token needs separate compute. Tokens that look similar can be averaged together with negligible quality loss. Redundancy across the spatial dimension this time — same pattern.

What unifies all of these is a methodological move: find a quantity inside the trained network that's nearly invariant along some axis (time, depth, position), and short-circuit the computation along that axis. The axis varies. The redundancy is real every time someone looks for it.

DeepCache's contribution to this pattern is to find the redundancy along the denoising-step axis in diffusion models specifically. That hadn't been quantified before. Once you have the heatmap, the algorithm is almost forced. The heatmap is the contribution.

I think the next interesting work in this area won't be more speedup techniques. It will be theoretical work that asks why this redundancy keeps showing up in such varied places, and whether it's a fixable training inefficiency or a load-bearing feature of how high-capacity networks learn. The empirical pattern is now too robust to be a coincidence.

7. What About DiT?

There's an obvious objection to DeepCache: it's U-Net specific. The state of the art in image and video generation is moving toward Diffusion Transformers (DiT). Stable Diffusion 3 uses MMDiT. Sora is DiT-based. Flux is DiT-based. The U-Net era in large-scale generative modeling is, if not over, then certainly tapering. Does any of this matter once nobody is shipping U-Nets anymore?

I think the method has a short shelf life and the observation has a long one.

The specific algorithm depends on the U-Net's clean separation between deep and shallow features via skip connections. DiTs don't have that clean separation. Every transformer block does roughly the same kind of computation, and there's no architectural prior for "this layer is the cacheable one." So DeepCache as written doesn't transfer.

But the observation — that adjacent denoising steps produce nearly identical internal representations — is about diffusion, not about U-Nets. There's no reason to think DiTs would be exempt. Early work on caching for DiT-based models is already showing up and reporting similar speedups, just with different boundaries for what gets cached.

The interpretive content of the finding — Reading C, the claim about how diffusion models organize their work — is even more architecture-independent. If the semantic-then-textural decomposition is real, it's a property of training a model to denoise, not of any specific architecture used to do it. The empirical signature might look different on a DiT (cosine similarity by token? by attention head?) but the underlying phenomenon should still be there.

So if you're reading this in a few years and DeepCache itself feels dated: the paper's algorithm probably is. The thing it pointed at probably isn't.

8. What I Actually Think

DeepCache is a useful paper, and most of the people writing about it are reading it the wrong way.

Read as a methods paper, it's solid: a clean algorithm, real speedups, careful ablations, an honest limitations section. It will get cited and adapted, and people who deploy diffusion models on U-Nets will get measurable wins from using it. That's a fine outcome and the authors deserve the credit.

Read as a measurement paper — which I think is the more honest framing — it's better than that. Without quite framing it as their headline contribution, the authors quantified a property of trained diffusion models that the practitioner community had been observing informally for years. The cosine-similarity heatmap in Figure 2(b) is the kind of plot that closes a loop. It turns "everyone notices the structure shows up early" into "here's the number, here are the models, here's how the redundancy is distributed across the trajectory."

What I'd want to see next isn't another caching trick. It's the followup that asks why. Probe the high-level features at different timesteps and see whether they encode the semantic content Reading C predicts. Vary the prompt midway through generation and check whether the high-level features re-stabilize at a new equilibrium. Run the same measurement on a DiT and see whether the redundancy migrates to a different axis or just goes away. That's the work that would turn an empirical curiosity into actual understanding.

DeepCache is, in its own quiet way, an interpretability paper that didn't know it was one. The speedup is the part that gets you to read it. The measurement is the part that's worth keeping.

Although this blog is my original work, I used AI assistance to refine structure, improve clarity, and enhance readability.

Comments