FRIGID Scales Molecular Structure Elucidation With Masked Diffusion

Runzhong WangMicrosoft ResearchThursday, June 4, 202612 min read

MIT postdoc Runzhong Wang argues that de novo molecular structure elucidation from tandem mass spectrometry is constrained less by instruments than by computation: researchers can produce high-quality spectra, but often cannot infer the molecules behind them. His talk presents DiffMS and FRIGID, two diffusion-based inverse models that decompose the task into spectrum-to-fingerprint prediction and scalable fingerprint-to-structure generation. Wang’s central claim is that scaling helps most where chemical structure data are abundant, while forward fragmentation models can guide inference by identifying parts of a generated molecule that do not match the observed spectrum.

The bottleneck is no longer acquiring spectra; it is turning them into structures

Runzhong Wang frames molecular structure elucidation as a computational bottleneck inside an otherwise established experimental workflow. In metabolomics, researchers begin with a sample of interest — for example plasma, urine, or another biofluid — and run it through LC-MS/MS, combining liquid chromatography with tandem mass spectrometry. The experimental process can produce a series of mass spectra, with the working assumption that each separated spectrum corresponds to a molecule. The unresolved step is structural elucidation: given a spectrum, infer the chemical structure that produced it.

Wang emphasizes that this matters because small molecules can be scientifically decisive. He points to primary metabolites, microbiome metabolites and therapeutic targets, cancer biomarkers, plant-derived natural products, and environmental toxicants as examples where identifying a single compound can reshape downstream understanding. One example on his motivation slide is 6PPD-quinone, described as a derivative of a tire additive whose trace presence is associated with high fatality in a salmon species.

The scale of the unresolved problem is large. Wang cites a 2022 study reporting that current computational methods cannot resolve roughly 87% of tandem mass spectra in existing databases. His interpretation is blunt: the instruments can generate reproducible, high-quality spectra routinely, but the field still struggles to recover molecular structures from spectral peaks.

~87%

of tandem mass spectra that current computational methods cannot resolve in existing databases, as cited by Wang from Bittremieux et al. 2022

The computational difficulty comes from the combinatorics of chemistry. A mass spectrum records peaks whose x-axis is an m/z value, which Wang says can be regarded as the mass of a fragment, and whose y-axis is intensity. For a peak, if the mass is known, one can enumerate chemically possible formula assignments and candidate substructures. But a molecule’s spectrum has multiple peaks, each with multiple substructure candidates. The elucidation problem becomes a molecular puzzle: choose plausible fragments and assemble them into a full molecular graph. Wang describes this as an NP-hard combinatorial optimization problem, because the number of possible molecules grows combinatorially with molecular mass.

That framing explains why the standard database-search workflow hits a coverage ceiling. In the non-machine-learning version, researchers query an unknown experimental spectrum against a library of spectral standards: compounds whose structures and spectra are already known. If a close match exists, the annotation can be strong. But Wang contrasts approximately 50,071 unique compounds in NIST’23 MS/MS with 111 million compounds in PubChem. The standard library is not small, but it covers only a small fraction of known chemical space.

Resource	Approximate coverage cited in the talk	Role in Wang's argument
NIST’23 MS/MS	50,071 unique compounds	A comprehensive commercially available spectral-standard library
PubChem	111 million compounds	A much larger chemical universe against which spectral libraries have limited coverage

Wang uses the gap between spectral standards and known structures to motivate machine-learning approaches.

Wang divides machine-learning approaches into two broad families. Forward MS models predict spectra from molecular structures. They preserve the retrieval pipeline by augmenting the standards library: predict spectra for many structures, then compare an experimental spectrum against that enlarged set. Inverse MS models attack the problem directly: take an experimental spectrum and predict a structural representation, a fingerprint, or a molecule.

He is careful not to dismiss the forward path. For real-world case studies, he says forward models are “more reliable,” particularly when a candidate set exists. But the talk focuses on inverse models because de novo generation from spectra aligns most directly with generative modeling.

DiffMS fixes the molecular formula and makes generation a graph problem

DiffMS is Wang’s first main example of an inverse generative model. It is a discrete graph diffusion model conditioned on a mass spectrum. Given a spectrum, DiffMS attempts to generate the molecule directly, rather than retrieve it from a fixed library.

The key design choice is to assume or predict the precursor molecular formula before generation. Once the precursor formula is known, Wang explains, the number and types of heavy atoms are fixed. That turns molecule generation into the problem of generating the adjacency matrix of a molecular graph. DiffMS starts from a randomly initialized adjacency matrix and performs discrete diffusion step by step, conditioned on spectral information, to recover the molecular graph.

The model’s pipeline has two modules and three training stages:

Stage	Input	Target	Training source cited by Wang
Spectrum encoder pretraining	Peak formulae from the spectrum	Molecular fingerprint	MS dataset of about 20K structures
Graph decoder pretraining	Molecular fingerprint	Molecular structure	Larger chemical libraries of about 2.8M structures
End-to-end finetuning	Peak formulae	Molecular structure	MS dataset of about 20K structures

DiffMS separates spectrum-to-fingerprint learning from fingerprint-to-structure decoding, then finetunes end to end.

The spectrum encoder begins by assigning possible subformulae to peaks. In response to a question, Wang clarifies that for a typical molecule there may be around 10 to 20 peaks, while the number of possible subformulae depends on the molecular formula and is finite under that constraint. If the formula is C6H5, for example, one can enumerate bounded ways to split it into subformulae such as C5H5. Wang says this assignment is “not that hard” when using mass as a proxy and that chemical formula inputs carry more information than raw mass values alone.

The spectrum encoder then feeds formula-annotated peaks into a transformer to predict an intermediate representation, initialized as a molecular fingerprint. A molecular fingerprint, as Wang later explains in discussion, is a fixed-length vector of bits indicating the presence or absence of substructures or motifs. It is not a lossless representation of the molecule. It behaves like a hashing function: useful structural information is retained, but collisions and missing detail remain possible.

This intermediate fingerprint is central to DiffMS because paired spectrum–structure data are scarce, while fingerprint–structure pairs are easy to produce once structures are available. Wang describes the graph decoder pretraining as where the model sees a scaling effect: pretraining on more fingerprint–structure pairs monotonically improves downstream de novo generation accuracy. He calls this a “scaling law” in graph decoder pretraining.

The source of the scalability is that fingerprints are derived computationally from molecular structures. Wang says one can compute a fingerprint for every structure in large chemical libraries; in principle, the amount of fingerprint–structure data is limited far less severely than annotated spectrum–structure data. The first half of the pipeline, spectrum to fingerprint, still requires supervised spectral labels. The second half, fingerprint to structure, can scale on unlabeled structures by generating fingerprints from them.

The spectrum encoder also incorporates domain-specific elements adapted from MIST. Wang lists several: chemical formula representations as inputs; pairwise neutral-loss fragments encoded in attention layers; heuristic labels for substructures; progressive unfolding for fingerprint prediction; and knowledge distillation from simulated spectra produced by separate forward models.

DiffMS improves the baselines, but exact de novo recovery remains hard

On the reported de novo generation benchmarks, DiffMS outperforms earlier baselines, though the absolute exact-match numbers remain modest. Wang reports metrics including top-1 and top-10 exact structural accuracy, MCES graph edit distance, and Tanimoto similarity over Morgan fingerprints. On a random split of NPLIB1, DiffMS reaches 8.41% top-1 accuracy and 15.44% top-10 accuracy. On a more difficult structural-clustering split from MassSpecGym, DiffMS reaches 2.30% top-1 and 4.25% top-10 accuracy.

Benchmark setting	Model	Top-1 accuracy	Top-10 accuracy	Tanimoto, top-10
NPLIB1 random split	MIST + MSNovelist	5.40%	11.04%	0.44
NPLIB1 random split	DiffMS	8.41%	15.44%	0.47
MassSpecGym structural-clustering split	MIST + NeuralDecipher	0.00%	0.00%	0.22
MassSpecGym structural-clustering split	DiffMS	2.30%	4.25%	0.39

DiffMS improves de novo generation results, but the harder out-of-distribution setting remains difficult for all models shown.

Wang does not present low exact-match accuracy as merely an engineering shortfall. Some errors are chemically close. In one qualitative example, DiffMS predicts the correct scaffold but places a hydroxyl group at the wrong position on an aromatic ring. Wang calls that “a reasonable mistake” in the sense that many positional isomers can share much of the same structural scaffold while differing in a substituent location.

There is also an instrumental limit. Wang points to ESI-MS/MS, which he describes as a soft ionization and fragmentation method that cannot break certain bonds. Some isomers may be difficult to distinguish by spectra alone. His slide compares leucine and isoleucine: different amino acids whose spectra are nearly similar under the shown conditions. The implication is not that the field has already reached a theoretical ceiling, but that even a perfect algorithm may face ambiguity when the data-generating instrument does not produce discriminative evidence.

This matters for interpreting the reported numbers. DiffMS may recover a scaffold, improve graph distance, or raise fingerprint similarity without achieving exact structural identity. Wang treats all three kinds of measurement as informative, but he also confirms in Q&A that reported accuracy is evaluated against the actual structure, not merely against the fingerprint.

FRIGID scales the part of the pipeline that can be scaled

FRIGID is Wang’s second main system and his answer to what DiffMS revealed: if fingerprint-to-structure pretraining improves with scale, then the architecture should make that part of the pipeline easier to scale. FRIGID replaces graph diffusion with a masked diffusion language model over a linearized molecular representation.

The model uses a sequence representation rather than a 2D graph. Wang discusses SAFE strings, a fragment-based linearized molecular representation used in prior work such as GenMol. In this setup, a molecule is represented as a sequence of tokens, and the model progressively unmasks tokens until it produces a full molecular sequence.

Wang distinguishes this from autoregressive language modeling. Autoregressive models generate left to right, one token at a time, conditioning each next token on previous output. Masked diffusion language models begin with masked tokens and fill them in an arbitrary order. In natural language, Wang says, whether masked diffusion is the right choice is still debated. For molecular generation, he sees a clearer motivation: molecules are not naturally ordered objects, so any fixed left-to-right order is somewhat artificial.

FRIGID keeps the broad spectrum-to-fingerprint-to-structure decomposition. An experimental spectrum and chemical formula are encoded into a fingerprint-like representation, and a conditioned masked diffusion language model decodes from that representation into a SAFE string or molecular structure. The scalable part is the fingerprint-to-SAFE decoder. Wang says FRIGID scales training to one billion structure–fingerprint pairs.

The reason only part of the pipeline is scaled becomes explicit in a Q&A exchange with Carles Domingo-Enrich. Enrich asks why the spectra-to-fingerprint model was not scaled. Wang answers that this part requires supervised labels: from spectrum to fingerprint, the model needs paired spectral data. By contrast, from structure to fingerprint is easy; for every structure or SAFE string, one can use RDKit to generate the fingerprint quickly. The second half can therefore use large amounts of unlabeled structure data, while the first half remains constrained by labeled spectrum–structure pairs.

This design also explains a recurring phrase in Wang’s talk: “scale the right things.” FRIGID does not remove the supervised bottleneck. It isolates it, then invests scale where data are abundant.

Inference-time scaling uses a forward model to find the fragments worth rethinking

FRIGID’s second scaling mechanism operates at inference time. Wang’s key claim is that masked diffusion enables a kind of targeted revision that is natural for molecules and difficult to reproduce with standard autoregressive decoding.

The process uses ICEBERG, a forward MS model developed in parallel by the group. ICEBERG takes a candidate molecule and predicts or simulates its spectrum. FRIGID can then compare the simulated spectrum to the experimental spectrum. Where peaks are unmatched, the system treats the mismatch as evidence that some generated fragments may be inconsistent with the observed data.

Wang describes the loop as follows. Start with a generated SAFE sequence from round zero. Feed the corresponding molecule into ICEBERG. Use ICEBERG’s mismatch signal to identify parts of the molecule believed to be inconsistent. Aggregate those consistency scores back to atoms, then from atoms to tokens. Convert token scores into a renoising probability, randomly remask selected tokens, and feed the partially masked sequence back into the FRIGID base model. The model then redencodes or re-denoises those regions and produces new candidate sequences for the next round.

In Wang’s words, the forward model is used to tell the generator: “maybe think again on these parts of the molecule, because the spectrum are not matched.” This is not a generic best-of-n sampling strategy. It is a cycle-consistent denoising–renoising loop guided by a forward simulation oracle.

The method also works, in principle, for DiffMS because DiffMS is also a diffusion model. Wang reports that adding ICEBERG-guided inference to DiffMS improves top-1 accuracy by about five percentage points on the shown setting. But DiffMS is much slower. On the ablation slide, DiffMS takes 131.21 seconds per sample, while DiffMS plus ICEBERG takes 1423.30 seconds per sample. FRIGID-base takes 6.58 seconds per sample, and FRIGID with the guided inference loop takes 86.49 seconds per sample.

Model	Top-1 accuracy	Top-10 accuracy	Inference compute
DiffMS	8.34%	15.44%	131.21 sec./sample
DiffMS + ICEBERG	15.33%	23.29%	1423.30 sec./sample
FRIGID-base	19.80%	24.41%	6.58 sec./sample
FRIGID	25.03%	33.37%	86.49 sec./sample

Wang reports that ICEBERG-guided inference improves diffusion models, but FRIGID is far more efficient than graph diffusion under the same style of refinement.

Wang’s architectural preference follows from this combination of training and inference constraints. Sequences are GPU-friendly, benefiting from optimized language-model infrastructure. Diffusion, as opposed to autoregression, permits partial targeted renoising and re-denoising. Wang says he has thought about whether a similar inference-time correction loop could be made to work for autoregressive models, but he has not found a successful idea. For FRIGID, the ability to revise fragments is not incidental; it is one reason for choosing masked diffusion.

The benchmark jump is large, but Wang still treats the forward models as more robust for applications

FRIGID’s benchmark results are substantially higher than DiffMS and other baselines shown in the talk. On NPLIB1, FRIGID-base reaches 19.80% top-1 accuracy, while FRIGID with inference-time refinement reaches 25.03%. DiffMS, by comparison, is shown at 8.34%. For top-10 accuracy, FRIGID reaches 33.37% versus DiffMS at 15.44%.

On MassSpecGym, a harder benchmark Wang uses as evidence of out-of-distribution generalization, FRIGID-base reaches 10.46% top-1 accuracy and FRIGID reaches 18.25%. DiffMS is shown at 2.06% top-1, while MS-BART is shown at 1.07% and MADGEN at 1.11%.

Dataset	Model	Top-1 accuracy	Top-10 accuracy	Top-1 Tanimoto	Top-10 Tanimoto
NPLIB1	DiffMS	8.34%	15.44%	0.35	0.47
NPLIB1	FRIGID-base	19.80%	24.41%	0.54	0.61
NPLIB1	FRIGID	25.03%	33.37%	0.58	0.64
MassSpecGym	DiffMS	2.06%	5.66%	0.36	0.44
MassSpecGym	FRIGID-base	10.46%	12.51%	0.39	0.41
MassSpecGym	FRIGID	18.25%	23.00%	0.45	0.47

FRIGID raises exact-match accuracy on both the in-distribution and harder benchmark settings reported by Wang.

Wang characterizes the NPLIB1 result as more than a threefold top-1 improvement over DiffMS. On MassSpecGym, he says the previous state of the art was around 10% top-1 and that FRIGID pushes it to about 18%.

At the same time, he does not claim de novo generation has become the most dependable path for every real-world metabolomics problem. He closes by separating two use cases. The de novo diffusion models are scientifically and methodologically interesting at the intersection of generative modeling and structure elucidation, and he sees substantial room for improvement. But the group’s forward models — including Marason, ICEBERG, Glacier, ICICLE, and FOAM — are described as more robust for downstream applications, especially where a candidate set exists.

That distinction is important. The generative inverse models are no longer just toy demonstrations: DiffMS is presented as the first method to recover correct molecules on the MassSpecGym benchmark under the spectrum-to-fingerprint-to-molecule framework, and FRIGID substantially improves the numbers. But Wang’s applied recommendation remains conditional. For a real-world case study with candidates, forward prediction and retrieval may still be the more reliable tool.

Fingerprints are useful precisely because they are imperfect compressions

The Q&A sharpens one of the central abstractions: the molecular fingerprint. Enrich asks how a full structure can be represented from a sequence, and Wang clarifies a related but broader point: the fingerprint does not encode the full structure without loss. From structure to fingerprint, information is lost. Wang compares it to a hashing function.

The fingerprint tells the model which motifs or substructures are present and which are absent. That can be rich enough to guide reconstruction: if the model knows that certain fragments exist, it can attempt to “glue them together” into a full molecule. But the fixed-length vector can also collide. A bit may correspond to one substructure, but it may also be activated by another. Increasing fingerprint length could reduce collisions and improve expressiveness. Wang says that with an infinite-length fingerprint, in theory, one should be able to recover the original molecule because it would capture everything.

This imperfection is not a side note; it is the reason the two-stage pipeline is both powerful and constrained. The first stage learns to infer an imperfect but informative molecular summary from spectral evidence. The second stage learns, at much larger scale, to reconstruct structures from that summary. Scaling can improve the decoder, but the upstream fingerprint prediction and the inherent ambiguity of spectra still matter.

It also clarifies why accuracy is measured structurally. Enrich asks whether reported values are computed over fingerprints or over structures. Wang answers that they are evaluated at the structure level. The model may pass through a fingerprint representation, but the task remains molecular structure recovery.

Data and Training Evals and Benchmarks AI Research Methods AI in Healthcare and Life Sciences