Wavelet Score Models Show Local Interactions Drive Diffusion Denoising

Emma FinnMicrosoft ResearchTuesday, May 26, 202612 min read

Emma Finn argues that the memorization puzzle in diffusion models can be probed by replacing a black-box score network with an analytically solvable wavelet parameterization. In her Microsoft Research New England seminar, Finn presents the method as a way to isolate which data moments and dependency structures matter across noise scales. Her reported experiments on MNIST suggest that local same-scale wavelet interactions improve denoising more consistently than independent coefficient models or orientation-only coupling, while the larger question of whether the framework explains generative novelty remains unresolved.

The puzzle is why the optimal denoiser should memorize

Emma Finn frames the problem as a tension between the theoretical behavior of an “ideal score machine” and the empirical puzzle that motivates diffusion research: models generate diverse, visually plausible samples, yet the idealized empirical denoiser seems pushed toward memorization. In the standard diffusion setup, data are gradually noised into a simple prior, usually a centered multivariate Gaussian, and generation reverses that process by using a learned score function, the gradient of the log density at each noise level. The only unknown in the reverse process is that score.

The difficulty is that, if the target is the empirical distribution of the training data, the optimal denoiser has a strong reason to memorize. Finn describes the dataset not as access to a true image distribution, but as “a bunch of Dirac masses” at observed training points, smoothed by Gaussian noise. Under that view, an empirical optimal denoiser behaves like a weighted average over training samples. At low noise, it should snap a corrupted test image back to the nearest training image.

A slide credited to Farghly et al. [2026] illustrated this behavior on CIFAR-10: samples generated using the empirical score function were compared with the closest image in the dataset, and the denoising trajectory eventually returned to something essentially matching a training sample. Finn also used MNIST as an intuition pump: a noisy digit 5 in a low-noise regime should be matched to its nearest neighbor in the dataset.

That is the paradox her work is aimed at probing. If the ideal empirical score memorizes, what makes diffusion models appear capable of producing novel, visually plausible images rather than simply reproducing training data?

Locality is one answer, but Finn treats it as both architectural and statistical

One proposed answer is locality. Finn describes the “local patch” view associated with Kadmon and Ganguli: instead of matching an entire image to a nearest training image, a convolutional model can match local patches. A small receptive field sees only a patch of the image and searches for similar patches in the dataset.

A purely location-specific patch match is not enough to capture the inductive bias of a CNN, because convolutional networks also have translation equivariance. The patch should not only be compared with patches at the same spatial position across images; it should be compared with similar patches from many spatial positions. Finn connects this to the “patch mosaic” notion of creativity: a generated image can be assembled from locally plausible pieces without being a nearest-neighbor copy of a full training image.

This locality story also explains failures. Finn points to the familiar case of generated hands with six fingers: a receptive field that is too local may see that adding a finger-like patch is plausible without representing the global constraint that five fingers are already present.

But Finn does not treat locality as only an architectural property. She reviews work arguing that similar locality emerges from image statistics themselves. In a figure from Matthew Niedoba and coauthors, a gradient sensitivity heatmap for a denoiser on CIFAR-10 shows that at low noise, an output pixel depends mostly on nearby input pixels at the same spatial location. At higher noise, the sensitivity region expands because a broader neighborhood is needed to denoise a more corrupted input. Finn emphasizes that this is presented as an architecture-independent property.

She then turns to work by Artem Lukoianov and coauthors, which compares a diffusion transformer, a UNet, and a high signal-to-noise projection. Their results, as Finn describes them, show large sensitivity regions at high noise that shrink at lower noise. The key mechanism is an optimal linear denoiser built from the data covariance matrix: its eigenbasis keeps or shrinks directions according to signal-to-noise ratio. For natural images such as CIFAR-10, high-variance principal components look low-frequency, producing a local, low-pass sensitivity blob. Finn highlights that the sensitivity fields learned by UNets and diffusion transformers look very similar despite different architectural biases, and match the high-SNR projection.

Her synthesis is deliberately non-exclusive: “Why not both?” Locality may arise partly from CNN-like architectures and partly from the statistics of natural images. The question becomes how to test which interactions matter.

Wavelets make the score decomposable without turning the model into a black box

Finn’s proposed probe is an analytically solvable wavelet parameterization of the score. The score is a vector-valued function, and in principle it can be expanded in any basis. If the true score were known, its coefficient in a given basis direction could be obtained by projection. But the true score is not known; diffusion models normally train a neural network to approximate it, which obscures what has been learned.

Wavelets are chosen because they provide a multiscale, spatially localized basis. Finn contrasts them with the Fourier transform. Fourier methods express a signal in terms of global periodic functions such as sine and cosine. That is compact for periodic or broadly supported signals, but less useful for localized signals: a simple local bump in time can become spread across many frequencies. Wavelets instead use compactly supported functions, preserving locality while representing structure at multiple scales.

Finn notes that her paper uses Daubechies wavelets. They are not the most visually intuitive wavelets, and they generally do not have a simple closed-form expression; they are defined implicitly by scaling and detail equations with known constants. But they are useful in image tasks and have a long history in image analysis and compression, following the multiresolution framework introduced by Stéphane Mallat in 1989.

For images, the 1D wavelets are extended by tensor products. This yields one scaling atom and three detail atoms at each location and scale, corresponding to different orientations: horizontal, vertical, and diagonal details. Finn describes the indexing with scale, translation, and orientation parameters. At each fixed scale and location, the three orientation components form what she calls a detail band.

The point is not that wavelets are a better image generator than trained UNets or transformers. Finn is explicit that her work is not trying to “beat trained UNets” or do better diffusion than transformers. The wavelet model is meant as a probe: a simpler, interpretable family of score models where locality and scale interactions can be controlled explicitly.

Closed-form coefficients are the trade: less flexibility, more interpretability

The parameterization starts from the fact that the score can be written as a sum over wavelet basis functions. Each coefficient would ideally be the inner product of the true score with a wavelet. Since the true score is unavailable, Finn estimates each coefficient using a feature vector extracted from the noisy image.

The design choice is the feature vector. It can include linear or nonlinear features of the image, depending on what property of the model or data distribution one wants to test. The restriction is that the learnable parameters interact linearly with those features. Finn is clear about the cost: “We’re losing a lot in terms of flexibility and modeling ability,” she says, “but we’re gaining a lot in terms of our ability to have a nice closed form solution.”

The closed-form solution comes from the standard diffusion loss. Because wavelets form an orthonormal basis, the loss decouples across coordinates. Setting the derivative of the loss to zero gives an expression for the optimal parameters. A remaining obstacle is that the expression still appears to involve an inner product with the true score. Finn removes that dependence using Stein’s identity, which she describes as a form of “fancy integration by parts” relying on mild regularity conditions and a vanishing boundary term.

With Stein’s identity, the parameter estimate can be written in terms of dataset image features, with a small ridge penalty for regularization and to avoid inverting a non-invertible matrix. In practice, Finn adds one caveat: wavelets are exactly orthonormal in the theory, but only approximately so on a finite pixel grid. Her report is that this approximation was “good enough” for their experiments, but it remains a limitation.

This is the core bargain of the method. Instead of training a black-box score network, Finn picks a restricted class of wavelet-feature interactions and solves for what that class can learn. The resulting performance gaps are then informative: if a richer interaction model denoises much better than an independent one, that gap says something about the value of the structure that was added.

The three model classes isolate independent moments, orientation coupling, and local coupling

Finn presents three explicit feature choices. Each corresponds to a different hypothesis about what structure matters for denoising.

The first is an independent baseline. Each wavelet coefficient is modeled independently, using polynomial features of the projection of the noisy image onto that wavelet. Finn does not expect this model to be strong. Its purpose is to establish what can be achieved without cross-coefficient structure, so that the gap to stronger models measures the predictive value of dependencies among coefficients.

The second is a band-tied model. At a fixed scale and location, it couples the three detail orientations with degree-D interactions. This targets cross-orientation co-activation caused by edges and corners. Finn describes it as roughly mirroring the channel-mixing inductive bias of CNN or UNet score networks at a single spatial site. If vertical, horizontal, and diagonal wavelet components jointly encode corners and edges, this model should capture some of that.

The third is a local coupling model. At a fixed scale and orientation, the coefficient at one location is allowed to depend on coefficients in a small local neighborhood. This directly probes locality: whether nearby wavelet coefficients carry useful denoising information beyond what a coefficient says on its own.

Model	Interaction allowed	Question it probes
Independent baseline	Each coefficient modeled independently	How much can per-wavelet moments explain without cross-coefficient structure?
Band-tied coupling	Three detail orientations coupled at a fixed scale and location	Do edges, corners, and orientation co-activation improve denoising?
Local coupling	Neighboring coefficients coupled within a local radius at fixed scale and orientation	How much denoising value comes from local spatial interactions?

Finn’s three wavelet score models isolate different dependency structures.

Finn also notes that the feature choice need not be limited to these three models. The framework is intended to be flexible: if a researcher wants to test whether a particular architectural property or data statistic contributes to denoising performance, they can encode that hypothesis through the feature vector and examine the resulting closed-form estimator.

The experiments favor same-scale local interactions over independent or orientation-only structure

The experiments use simple datasets, with MNIST resized from 28-by-28 to 32-by-32 as the main example discussed. Finn repeatedly cautions that this is “mostly a theory paper,” and that the experiments are not the main contribution. Still, they provide sanity checks and qualitative signals.

The MSE charts shown in the seminar carry much of the empirical comparison. They compare mean squared error across noise levels for the independent baseline, the band-tied model, and the local coupling model, with bars for different polynomial degrees. The transcript does not provide exact numeric MSE values, so the useful comparison here is the qualitative pattern Finn reports from those charts rather than a table of precise measurements.

For the independent baseline, denoising results are “pretty okay,” even at fairly high noise regimes, though high noise produces visible artifacts. Increasing the polynomial degree from one to two to three generally improves performance, but with diminishing returns at higher noise. Finn’s interpretation is that higher moments, such as the third moment of the data, are not very informative on their own in high-noise settings.

The band-tied model produces a more mixed result. Finn says it may yield visually better edges at high noise, but the measured results are “okay, not great,” especially in high-noise regimes. Her explanation is that trying to resolve corners and edges at high noise can introduce more problems than it solves. At high noise, apparent corners may be artifacts of the random corruption rather than reliable image structure.

The local coupling model performs best among the three. Allowing wavelets in the same local neighborhood, at a fixed scale and orientation, to interact gives more consistent gains across noise levels. Finn says the diminishing-returns phenomenon seen in the independent model does not appear in the same way. The model produces visually nicer denoising results and is “quite good as a denoiser” within the restricted experimental setup.

Model	Reported pattern	High-noise behavior
Independent baseline	Higher polynomial degree generally helps, with diminishing returns	Artifacts appear; higher moments alone become less informative
Band-tied coupling	May improve perceived edges and details	Resolving corners and edges can introduce more problems than it solves
Local coupling	Most consistent improvement across noise levels	Does not show the same diminishing-returns pattern Finn reports for the independent model

The result charts are discussed qualitatively in the source; exact MSE values are not given in the transcript.

32×32

image size used for the MNIST denoising comparisons discussed in the talk

The pattern supports Finn’s narrower empirical claim: local same-scale interactions are particularly informative for denoising in this wavelet score family. That conclusion is coherent with both sides of the earlier literature she reviewed — architecture-based explanations that emphasize convolutional locality, and data-statistical explanations showing that natural image denoisers have local sensitivity fields.

At low noise the wavelet models look competitive, but high noise exposes the missing interactions

Finn compares the analytical wavelet models with trained diffusion models, while stressing that outperformance is not the objective. The wavelet models have many fewer parameters, and their purpose is to make behavior interpretable.

The comparison is nonetheless informative. At low noise, Finn says the wavelet-based models are roughly competitive with trained denoisers. Her interpretation is cautious: it “seems like” the trained models may be doing something “roughly similar” in that low-noise regime, where the wavelet models do a good job of denoising at small noise scales.

At high noise, the gap increases sharply. Finn treats this as expected. Trained denoisers can use deeper, nonlinear interactions and broader multi-scale dependencies that the three simple wavelet models do not encode. This distinction matters: Finn’s best-performing model couples nearby coefficients at a fixed scale and orientation, while trained denoisers can combine information across scales and through deeper nonlinear computations. The analytical models also struggle with generation from pure noise. They are useful probes of score structure, not full replacements for trained generative models.

Her summary of the empirical picture is compact: increasing polynomial degree helps, but independent models have diminishing returns; band-tied orientation coupling can improve perceived details and edges but may hurt at high noise; local interaction terms give the most consistent improvement; and trained denoisers dominate at high noise because they can exploit richer interactions.

The unresolved question is whether this wavelet probe can explain generalization, not just denoising error

In the question period, an audience member presses on the connection between the wavelet results and the original memorization problem. The reported metrics are MSEs, while the motivating paradox is about whether diffusion models memorize or generalize. The question asks whether wavelets help explain that generalization, and how the work relates to literature on harmonic bases, locality, and nearest-neighbor matching.

Finn’s answer is direct: for this paper, they do not yet have that result. She says they are hoping to get there eventually, but “not yet.”

She points instead to Kadmon and Ganguli’s “Analytic Theory of Creativity in Convolutional Neural Networks” as the relevant comparison. As Finn describes it, that work gives a closed-form approximation to a CNN score and solves for the optimal denoiser within a CNN-like function class. They show, empirically and theoretically, that this restricted optimal denoiser does not memorize. The key idea, in Finn’s account, is that restricting the function class forces mistakes; if the function class is well chosen, those mistakes are informative and help generalization.

That distinction matters for Finn’s work. In her description, the current wavelet models show which score interactions improve denoising under controlled assumptions, while the direct memorization-and-novelty question remains future work. The hope is to extend the framework toward “wavelet-based ideal score machines” that could make that comparison more direct.

Future work also includes more robust approximation guarantees and finite-sample error bounds for the wavelet estimators; experiments at higher resolution, where prior work suggests wavelets can speed up sampling and improve image quality; comparison with linear denoisers based on large score matrices and eigendecompositions; models with multi-scale dependence across different wavelet scales; and extension to RGB and multi-channel images, which Finn says should work naturally by decomposing each channel.

AI Research Methods Image and Video Generation