Hard Constraints Steer Generative AI Toward Chemically Valid Materials

Mouyang ChengMicrosoft ResearchThursday, June 4, 202616 min read

MIT PhD student Mouyang Cheng argues that generative models for materials discovery need explicit scientific constraints, not just larger diffusion models. In a Microsoft Research seminar, he describes two approaches: diffusion inpainting that forces generated crystals to contain target structural motifs, and CrysVCD, a valence-constrained framework that generates charge-balanced formulas before predicting structures. His case is that constraints such as motifs, valence and stability screens make generative materials design more useful in a field where data are sparse and chemically invalid samples are easy to produce.

Materials generation needs constraints because the data are too small and the rules are too hard

Mouyang Cheng framed materials inverse design as a problem where modern generative models are useful but not sufficient on their own. The objective is not simply to produce plausible-looking crystals. A generated material must satisfy basic chemistry and physics before it is worth handing to an experimentalist or spending expensive simulation time on it.

In Cheng’s account, the computational representation of a crystal has three main parts: atom types, fractional coordinates, and a lattice. The atom types, A, are the elements assigned to the sites; the fractional coordinates, X, locate the atoms inside the periodic unit cell; and the lattice, L, defines the shape of the cell. Once those are specified, the model has a candidate material.

The standard of a “good” material is more demanding. Cheng emphasized two requirements. First, it should be stable enough to be synthesizable and persist outside the calculation. A common loose screen is energy above hull: if a proposed compound sits too far above the convex hull of known low-energy structures at the same composition, it tends to decompose into neighboring phases. Cheng used the conventional threshold of less than 0.1 eV per atom as a minimal stability criterion. Second, the material should have a desired physical property: conductivity, magnetism, mechanical stiffness, thermal conductivity, dielectric behavior, or another target.

<0.1 eV/atom

typical energy-above-hull threshold Cheng used for a stable material candidate

The difficulty is that inorganic materials discovery has both a vast search space and a small usable dataset. Cheng contrasted the enormous number of potential materials with the much smaller number of known entries in resources such as the Materials Project. The slide he used put the gap starkly: 10¹⁰–10¹² potential materials, 10⁵–10⁶ known materials, 10,000–100,000 candidates after screening, and five to ten materials for the lab. It also showed 153,235 materials in The Materials Project.

Stage or reference point	Scale shown in Cheng’s slide
Potential materials	10¹⁰–10¹²
Known materials	10⁵–10⁶
Candidate materials after screening	10k–100k
Materials sent to lab	5–10
The Materials Project	153,235 materials

Cheng’s slide used these counts to illustrate the gap between possible inorganic materials and experimentally actionable candidates.

In his words, “if you do deep learning, this is not really a large, large dataset. But actually, this is what we have.” The bottleneck is not just computational. Traditional screening begins with many candidates, applies successive property filters, and eventually sends a handful to the lab. Cheng joked that a principal investigator can “grab a PhD student” to synthesize one, which may take three-plus years.

That setup motivates inverse design: instead of starting from thousands of candidates and filtering downward, start from a desired property and generate candidate structures directly. Diffusion models and other deep generative methods offer a way to learn a high-dimensional distribution of materials and sample from it under conditions. But Cheng’s central claim was that, for materials, the usual machine-learning preference for scale and minimal human priors is not enough. The “bitter lesson” helps when data and compute can scale. Materials discovery is different because useful crystal data are sparse, biased, expensive to label, and tightly constrained by chemistry and physics.

Cheng’s formulation was blunt: “We have to add constraints where science already knows the answer.”

The practical compromise Cheng argued for is to keep scalable generators but add hard constraints where the field already has reliable knowledge: structural motifs, valence balance, stability screens, symmetry, and composition rules. He was explicit that this might not be permanent. If the field eventually has much more materials data, some of these constraints may become unnecessary. “In the future, if there are much more data, then all these slides will become trash,” he said. For now, he argued, constraints are not a philosophical preference; they are a response to the data regime.

Diffusion models can generate crystals, but unconstrained sampling misses obvious structure

The technical background for Cheng’s work is the adaptation of diffusion models from images to periodic crystals. In images, diffusion is conceptually straightforward: add Gaussian noise pixel by pixel, then train a neural network to reverse the noising process. A material is not a rectangular grid of pixels. It is a periodic graph-like structure with discrete atom identities, continuous fractional coordinates, and a continuous lattice.

Cheng described the basic move as joint diffusion over the crystal representation. Fractional coordinates and lattice variables can be treated with Gaussian-style denoising. Atom types, being discrete labels over the periodic table, require a discrete denoising process analogous to diffusion language models. The model must learn how to reverse corruption over A, X, and L so that a noisy or random configuration becomes a plausible crystal.

DiffCSP was Cheng’s first milestone example. It handles a subtask called crystal structure prediction: the composition is given, and the model denoises the lattice and fractional coordinates. Cheng’s intuitive description was that the colors of the balls are known, but the model must determine the box and where the balls sit inside it. His slide described DiffCSP as a “diffusion model on periodic cells” and cited NeurIPS 36 (2023): 17464–17497.

MatterGen, which Cheng described as a later Microsoft-led state-of-the-art model, extends this idea by generating atom types, fractional coordinates, and lattice jointly from scratch. It also supports conditional generation. A pretrained score network learns the broad structure distribution; an adapter module can be fine-tuned with labeled data for a desired condition, such as hardness, toxicity, color, or other target properties. Cheng’s slide identified MatterGen as a Nature article, “A generative model for inorganic materials design,” published January 14, 2025, and summarized its capabilities as joint generation of A, X, and L and generation with desired properties.

Model or framework	How Cheng positioned it	Reference shown on slide
DiffCSP	Crystal structure prediction with fixed composition; denoises lattice and fractional coordinates	NeurIPS 36 (2023): 17464–17497
MatterGen	State-of-the-art joint generation of atom types, coordinates, and lattice, with property conditioning	Nature 639, 624–632 (2025)
SCIGEN	Adds explicit structural motif and lattice-type constraints	R. Okabe, M. Cheng et al., Nat. Mater. 25, 223–230 (2026)
CrysVCD	Adds explicit valence constraints on atomic composition	M. Cheng et al., arXiv 2507:19799; Nature Computational Science, in press

The core model lineage and constraint methods as Cheng presented them.

Cheng called MatterGen “kind of the best model even until now” while using it as the backdrop for why additional constraints are still needed. The problem is that models trained broadly on crystal data can make mistakes that are scientifically obvious.

First, generative models may tend to produce low-symmetry structures. Cheng was referring to crystallographic symmetry: a generated structure may not be cubic, trigonal, hexagonal, or otherwise high-symmetry, even though many known crystals have substantial symmetry. That matters because high symmetry and symmetry breaking often carry physical significance, especially in quantum materials.

Second, generated structures may fail to contain target structural motifs, such as triangular, honeycomb, or kagome lattices, even when those motifs are central to the desired physics. Those motifs are not the same category as crystallographic symmetry classes; they are geometric patterns in the atomic arrangement that may be associated with frustrated magnetism or flat electronic bands.

Third, many generated structures violate chemical valence. Cheng’s simplest example was sodium chlorine four: “NaCl4 never exists.” Earlier and contemporary models, he said, could generate formulas such as ZnBO3, TeI9 or TeI6, Li3Al2N, and Li(SnCl3)2. For ionic and covalent compounds under ambient conditions, charge imbalance is not a small modeling imperfection. Cheng described it as implying “infinite Coulomb interaction repulsion energy” and as the kind of error that would immediately destroy trust with experimentalists.

These failures define the two directions of Cheng’s work: structural-motif constraints through inpainting, and valence-constrained generation through a composition-first framework called CrysVCD.

Structural inpainting fixes the motif and lets the model fill in the rest

The first constraint Cheng discussed was structural: force the generated crystal to contain a desired motif while allowing the rest of the structure to remain flexible. The motivating application was quantum materials. Some systems with phenomena such as quantum spin liquids or flat electronic bands tend to involve geometrically frustrated lattices, including triangular, honeycomb, and kagome patterns. Cheng described quantum spin liquids as magnetic systems where spins are disordered but have long-range entanglement, unlike an ordinary magnet. Flat-band materials and related structures are also of interest for possible high-temperature superconductivity, though Cheng avoided overclaiming generated candidates as future superconductors.

The key question was not whether a model can sometimes generate such motifs. It was whether the user can demand them: “I just want these. I keep generating triangular materials and see what I like.”

An audience member asked where the variability comes from if the motif is fixed. Cheng’s answer is important: the motif is only one part of the crystal. Other atoms, lattice details, and unconstrained structural degrees of freedom remain free. The goal is to keep a specified template fixed while letting the base generative model fill in the surrounding material in a way that still respects the learned distribution of crystals.

The method borrows from image inpainting, particularly the RePaint algorithm. In image inpainting, a known region is fixed and the unknown region is generated. During denoising, the model repeatedly merges the constrained part with the unconstrained part so that the final image respects the known mask while sampling the missing content. Cheng’s group applied the same principle to crystal graphs. The masked atoms are the structural motif — for example, magnetic atoms forming an Archimedean lattice — while the unmasked atoms are free to move through the diffusion process.

For crystals, the constrained inputs include the lattice template type, such as triangular, honeycomb, or kagome; the atom type at the motif vertices, such as Mn, Fe, Co, Ni, Ru, Rh, Gd, Tb, Dy, or Yb; the bond length of the vertices; and the number of atoms in the unit cell. The base model can be any pretrained crystal diffusion model, including DiffCSP or MatterGen. During inference, the constrained part is pre-sampled with the fixed motif, corrupted to match the relevant diffusion timestep, and repeatedly merged with the unconstrained noisy sample as denoising proceeds.

Cheng stressed that this is training-free. The user does not need to retrain a model for every motif. Once there is a pretrained diffusion model for crystals, the inpainting procedure can impose a template at inference time. “You can draw anything like over there and just let the model do the work,” he said.

A question from the audience raised a subtle point: if only the template atoms are constrained, why do the other atoms in the examples appear to occupy high-symmetry positions? Cheng did not give a definitive quantitative answer. His interpretation was that the base model still preserves learned crystal-structure statistics. If the constrained atoms exhibit a significant symmetry, the model may infer from the data distribution that the remaining atoms should organize compatibly. He characterized that as a good sign that the generator is still respecting the general materials distribution.

Another audience member asked whether “adding” the constrained and unconstrained parts meant literal vector addition. Cheng clarified that it is replacement or merging, not vector addition. As in image inpainting, the masked region is taken from the constrained sample and the remaining region from the unconstrained sample.

The generated structures were then relaxed with density functional theory calculations. Cheng said the downstream verification involved quantum-mechanical calculations of energies and forces to test whether the structures could really be stable, and that the stability rate was more than 50%. He also said the method could discover interesting flat-band material candidates, while explicitly cautioning that he did not want to overclaim them as future superconductors.

CrysVCD’s gains combine two changes Cheng had not ablated apart

The second constraint was chemical valence. Cheng’s CrysVCD framework — Crystal generation with Valence-Constrained Design — changes the generation process from fully joint diffusion over atom types, coordinates, and lattice to a two-step workflow. First, generate a chemically valid formula with balanced valence. Second, generate a crystal structure conditioned on that composition.

That design choice matters, but Cheng acknowledged a live attribution problem. A Teams question asked what fraction of CrysVCD’s stability gains came from decomposing generation into composition-first and structure-second stages, versus the valence constraint itself. The architecture changes both. Cheng answered directly: they had not performed the ablation dropping the valence constraint.

The move is still simple and consequential. Cheng argued that a diffusion model should not spend expensive structure-generation steps on formulas that violate elementary charge-balance rules. A language model can generate a short chemical formula in fewer than five steps, while a diffusion model may require roughly 1,000 passes through a large graph neural score network. If the formula is invalid, it can be rejected before the expensive stage.

~1000 steps

diffusion-model inference scale Cheng contrasted with fewer than five formula-generation steps

CrysVCD begins by explicitly annotating chemical valence in the dataset. A formula is tokenized into valence-specific elemental tokens. Sodium chloride becomes a sequence equivalent to start, Na+, count 1, Cl−, count 1, end. Fe3O4 becomes more interesting because the iron atoms can have different valences: Fe2+ and Fe3+ are treated as distinct tokens, with counts corresponding to the formula. Cheng described this as a way to make the model understand that different oxidation states of the same element are chemically different.

Because chemical formulas are not ordered sequences in the ordinary linguistic sense, the probability should be permutation invariant: starting with chlorine or sodium should not change the underlying formula probability. Cheng said the group used data augmentation to address this. The tokenization was inspired by electronic-configuration representations from prior work, but his main point was operational: once formulas are discrete tokens, standard autoregressive language-model architectures can generate them.

Only after a formula passes the valence check does the geometric diffusion model perform crystal structure prediction. In Cheng’s summary, the diffusion model “only performs crystal structure prediction from this given chemical composition.” The framework treats alloys and ionic or covalent compounds separately, and it is designed as a plug-in style enhancement rather than a replacement for state-of-the-art diffusion models.

The claimed benefits are higher efficiency, explicit chemical validity, and more control. A user can generate formulas until one is desirable, then predict structures. The same process can be guided by properties so long as those properties can be computed and verified.

The valence constraint is not free. In response to a Teams question, Cheng addressed the concern that hard valence constraints could exclude genuinely novel chemistry, unusual oxidation states, ambiguous valence assignments, strongly correlated systems, charged defects, or other complex materials. He agreed that the constraint loses some cases. In their original database, the group could assign valence annotations to 98% of entries and dropped the remaining 2%. He described that as a limitation of the constraint, not something to hide.

But he defended valence balance as useful for a large fraction of materials. After discussion with experimentalists, Cheng said, the group concluded that many materials still satisfy this constraint, and that it is worth encoding. He also observed that he has started seeing fewer papers with charge-imbalanced generated materials, though he did not claim to know whether that was because of this work or because models have improved.

Stability screening needs more than distributional similarity

Cheng was critical of evaluations that only ask whether generated structures resemble a benchmark distribution. Earlier works, he said, often focused on validation-set distributional similarity, partly because they were conference papers. But a material-design model should be checked for stability.

CrysVCD therefore adds a more systematic stability workflow. The loose stability metric is energy above hull. The stricter screen is phonon stability. Cheng explained phonon bands as eigenvalues of the Hessian of the energy landscape, evaluated across vibrational modes and momenta. A stable structure should not have negative modes beyond tolerated numerical or minor-instability thresholds. If the Hessian has negative eigenvalues, the structure is at a saddle point or transition state rather than a true stable minimum.

Computing these quantities directly with density functional theory or density-functional perturbation theory can be costly. CrysVCD instead uses a machine-learning interatomic potential, MatterSim, for on-the-fly evaluation of energy, forces, energy above hull, and phonon stability. Cheng described MatterSim as a Microsoft model that can take a structure and quickly return energies and forces with much lower computational cost than DFT. His slide cited benchmarks on energy, forces, and phonons.

CrysVCD also uses conditional generation to improve stability rates. Cheng noted that materials databases mostly contain stable materials; they do not provide a rich set of negative examples. His analogy was exam practice: it is not enough to know correct answers; one also benefits from knowing wrong answers and how they go wrong. The group generated materials with unconditional CrysVCD, evaluated them, labeled them as stable or unstable, and then fine-tuned a conditional model with both positive and negative examples. The conditional model can then be guided toward stable regions.

The results Cheng presented showed CrysVCD significantly outperforming DiffCSP on stability metrics, with an 85% rate for energy above hull below 0.1 eV and a 68% rate for phonon stability.

Stability metric	CrysVCD result Cheng reported
Energy above hull below 0.1 eV	85%
Phonon stability	68%

The stability results shown on Cheng’s CrysVCD evaluation slide.

He said later work compared against MatterGen and performed better, but he immediately qualified the comparison as unfair because MatterGen was trained on a much larger proprietary database that his group did not have access to.

An audience member pressed on whether the ML interatomic potential may be out of distribution for novel generated compounds, producing errors in phonon dispersion. Cheng agreed that it “definitely” could. The group did not fully check the issue because of cost. His defense was comparative: even if the model has systematic error, the same evaluation can compare different generative models under the same screening procedure, and CrysVCD remains most stable under that metric. He also said modern MLIPs have benchmarks suggesting reliability for phonon-related calculations, and that the group ran a crude internal benchmark that looked reliable, though it was not in the paper.

Another question asked how phonon stability differs from regular crystal energy stability. Cheng’s answer separated thermodynamic and dynamical checks. Energy above hull asks whether a material is low enough in energy relative to decomposition products. Phonon stability asks whether the proposed local structure is dynamically stable under vibrations. In conventional terms, one expects three zero modes from translational degrees of freedom and the remaining vibrational modes to be nonnegative. The group did not evaluate only the Gamma point or a path interpolation; Cheng said they used a full 3D grid, with a soft threshold rather than a strict greater-than-zero rule to allow minor instabilities, especially in layered materials. He added that the work did not consider pressure-stabilized layered cases and that most generated materials were bulk.

Property guidance turns the constrained generator back toward discovery

Cheng’s final application was not just making valid and stable samples, but using CrysVCD for functional materials discovery. Once the model can generate chemically feasible and more stable candidates, it can be guided toward target properties using classifier-free guidance, provided those properties can be labeled or screened.

The examples Cheng gave were high thermal conductivity and high dielectric constants. The workflow is to label data, build or use a fast surrogate model for screening generated materials, and then verify promising candidates with the same labeling method, including DFT. A slide for the thermal-conductivity application showed filtering toward binary or ternary, light-mass, non-metal candidates, with 368 candidates in the depicted flow. Cheng said the candidates found by the model were verified by the labeling method and matched “pretty nice.”

The broader point was that constraints are not meant to narrow generation into a hand-coded search over known materials. They are meant to remove failures that should never have survived in the first place, so the model’s sampling budget can be spent on chemically and physically credible regions. Cheng described CrysVCD as “designed not to replace SOTA models, but more like a plug-in module to enhance performance of all diffusion models.” In his comparison to other state-of-the-art models besides MatterGen, he emphasized explicit chemistry and physics-inspired knowledge, two-step generation, higher fidelity, and greater control over outputs.

During Q&A, Cheng also gave practical scale details. The dataset used for model training was about 40,000 to 50,000 structures including validation. The trained models were on the order of hundreds of thousands of parameters to about one million, and training took roughly one to two days on one A100 GPU once the setup worked. He characterized the size as similar in spirit to MatterSim, though not the same architecture or exact scale.

The final technical message was that diffusion can be applied to periodic crystals by representing them through atom types, fractional coordinates, and lattice; that property-guided generation can use classifier guidance, classifier-free guidance, and other guidance methods; and that, in the current materials-data regime, human prior knowledge remains valuable. Cheng’s view was not that constraints replace generative modeling. It was that they make generative modeling usable for a domain where invalid samples are cheap to produce but expensive to take seriously.

Data and Training Evals and Benchmarks AI Research Methods