Orply.

AI Evaluations Give Philanthropy a Lever Over What Developers Optimize

B CavelloThe Aspen InstituteThursday, May 7, 202610 min read

Aspen Digital’s B Cavello argues that AI evaluations should be understood by philanthropy as a way to shape the AI ecosystem, not merely as technical measurements or benchmark leaderboards. In a briefing for philanthropic leaders convened with Siegel Family Endowment, Cavello says funders can influence what AI developers optimize for, support outside accountability through audits and related tools, and help users judge when systems are appropriate for their needs.

Evaluation is a philanthropic lever, not just a technical measurement

B Cavello frames AI evaluations as more than technical scorekeeping. For philanthropy, the relevant question is how evaluation can shape the AI ecosystem: what goals developers pursue, who has tools to hold systems accountable, and whether users can understand when an AI system is appropriate for their context.

Cavello distills the philanthropic role into three levers: goal definition and field-shaping, accountability and oversight, and appropriate-use education.

The first lever is to fund mechanisms that define what the field tries to achieve. Benchmarks are central here because they can translate an aspiration into something measurable and visible. Cavello uses AlphaFold, from DeepMind, as an example of work that responded to a benchmark, challenge, or competition around predicting protein-folding structures. In Cavello’s account, the benchmark helped identify a problem as a worthy target.

We can define, through the language of evaluations like benchmarks, what we actually aspire toward with AI.
B Cavello · Source

That is the field-shaping function Cavello wants funders to notice. If developers define their goals in order to pursue a benchmark or climb a leaderboard, then benchmarks are not merely after-the-fact measurements. They become part of the machinery that tells builders what counts as progress.

The second lever is accountability and oversight. Philanthropy can fund people and institutions trying to hold AI developers and deployers accountable, including through audits, red-teaming, investigative work, interpretability research, and other evaluation practices that help outside actors understand whether systems are working as intended and whether they “work for all of us.”

The third lever is appropriate-use education. Evaluations can help users decide whether a system is likely to work for their needs. Cavello mentions the WEVAL team at Collective Intelligence Project as an example of work oriented toward user evaluation: helping people understand which models are right for them. In this framing, evaluations are not only instruments of regulation or technical competition. They can also be tools for education and informed choice.

Cavello gives funders and investors a distinct role because they can “turn up the volume” on different stakeholders. Philanthropy, in that account, is not merely a potential consumer of evaluation results. It can shape which evaluations are built, whose questions get answered, and which parts of the AI ecosystem receive scrutiny.

Benchmarks are visible, but they are only one kind of evaluation

Cavello defines AI evaluations as measurements used for three broad purposes: making comparisons, defining thresholds, and managing AI systems. A benchmark can help answer whether one model is “better,” “faster,” or “stronger” than another. A regulator might use an evaluation to create a measurable cutoff between acceptable and unacceptable options. A builder or deployer might use evaluations to monitor whether a system is behaving as expected.

That breadth is the point. Cavello emphasizes that “evals” mean different things depending on the stakeholder. Their roots are in academic AI research, but the practice has moved into industry as companies have brought more AI R&D in-house. Evaluations also matter for regulators, journalists, users, investors, and funders. Each group asks different questions of the same broad category of tools.

For academia and industry, evaluations may be a way to compare technical progress or improve systems. For oversight actors, they may be a way to inspect whether systems are behaving safely or fairly. For journalists, they may be a way to investigate how companies and their models behave in the world. For users, they may be a way to make informed choices about which model is appropriate for their own needs.

The important distinction is that benchmarks are one kind of evaluation, not the whole evaluation system. Cavello’s frame is “benchmarks and beyond”: benchmarks have become the visible part of the discussion, but they sit within a broader ecosystem of measurements, audits, tests, studies, and monitoring practices.

Benchmarks have moved from research infrastructure into public discourse

Benchmarks have broken out of their usual place in AI research communities. Cavello points to a trends chart for “AI benchmark” over roughly five years, from July 2020 through January 2025, that rises from what they describe as “basically nothing” to a sudden spike. Their reading of the chart is that benchmarks have become “a hot topic in the public discourse” and a more salient idea for people outside the research community.

5 years
period shown in Cavello’s “AI benchmark” trends chart, from July 2020 to January 2025

That mainstreaming matters because benchmarks can influence what builders try to create. If developers set goals in order to climb a leaderboard, then a benchmark can affect what counts as progress.

Cavello situates the current evaluation problem inside a compressed history of AI: expert systems, deep learning, and attention-based transformer models. Expert systems tried to encode rules directly: what makes a cat, what defines grammar, what a machine-interpretable description of the world should contain. Deep learning changed the approach by allowing systems to learn patterns from large amounts of example data rather than requiring explicit rules. Transformer models, beginning in the late 2010s in Cavello’s account, are described as a further rethinking or extension of deep learning that has produced tremendous changes in how AI systems work and feel to end users.

The evaluation ecosystem, in Cavello’s framing, is rooted in the deep learning era but is now being reinterpreted for the transformer era. That distinction matters because evaluation practices developed for one mode of AI may not automatically answer the questions raised by newer systems and deployments.

The useful map is the AI lifecycle, not a single leaderboard

B Cavello uses the AI lifecycle to show that evaluations can occur at many points, not only after a model has been built. The lifecycle begins with defining a goal, then gathering data, selecting a model, building the model, deploying it, and monitoring it in the world. Feedback loops connect the later stages back to earlier decisions.

The map matters because it places different kinds of evaluation at different points of leverage. Some evaluations require internal access to the model or data. Others can be conducted after a system is already deployed. Some are useful to developers; others are useful to regulators, integrators, users, or outside oversight actors. The slide Cavello uses makes this explicit by pairing each lifecycle stage with example evaluations and the stakeholders who can use them.

Lifecycle stageEvaluation examplesStakeholders named
Define goalBenchmarks; potentially broader benchmark-and-goal processesDevelopers; Cavello argues integrators, oversight actors, and users should also be involved
Gather dataData analysis, including analysis of synthetic dataDevelopers, integrators, oversight
Select modelFormal methods analysis; model efficiency testingDevelopers, specific integrators and users
Build modelBenchmarks; red-teaming; interpretability analysisDevelopers, integrators, users, oversight
DeploymentIntegration testing; staged release dataDevelopers, integrators
MonitoringLongitudinal impact studies; user data; auditsDevelopers, integrators, users, oversight
Cavello’s lifecycle map places evaluations across the development and deployment process, not only at the model benchmark stage.

The bottom of the stack — monitoring — is especially important for actors without internal access to an AI-building institution. Once a system is already “out in the wild,” researchers, regulators, users, and oversight groups may still be able to study its effects. Cavello names longitudinal impact studies, audits, and user-data analysis as examples. Monitoring can focus on the system itself, but it can also focus on broader social impacts.

Cavello contrasts this with the kind of monitoring available to major AI companies. In their example, “the OpenAIs of the world” are not simply waiting to observe downstream effects indirectly; they can monitor user behavior daily, including changes in how people prompt systems. That kind of internal access changes which evaluations are possible.

Deployment-stage evaluations include integration testing and staged release data. Cavello describes staged release as a kind of limited preview before a model is released to the public, often for AI safety or responsible AI reasons. Sora is Cavello’s example of a system previewed to a smaller set of actors before broader release. For developers and integrators, information from staged releases can inform whether and how to incorporate a model into a product.

The model-building stage is where the most familiar evaluation practices sit: benchmarks, red-teaming, and interpretability analysis. Cavello describes this as a stage where “everyone is invited to the party.” Developers use the information, but so do integrators, users, and oversight actors. Regulators and other oversight bodies may use model-stage evaluations to “poke at and prod and test” models and ask whether they do what society wants.

Further upstream, model selection may involve formal methods analysis or efficiency testing. Efficiency can matter when choosing among models because compute and energy use may be relevant constraints. Formal guarantees or constraints may matter for particular users, especially in safety-critical contexts. Cavello gives the Air Force and safety-critical industries as examples of settings where a user might be less interested in general-purpose large language model capabilities and more interested in models with particular guarantees of performance.

At the data-gathering stage, evaluations can ask what data went into the system. Cavello points to concern about “really problematic content” entering training datasets and notes that synthetic data is increasingly used for training. That makes dataset composition itself an object of evaluation: what was sourced from the wild, what was generated, and what is actually present in the data used to build the system.

The most upstream stage is goal definition. Cavello acknowledges that there are not many conventional evaluations for assessing the goal itself. But benchmarks still influence this stage because developers may define goals around benchmark performance. That is the point at which Cavello most directly challenges the developer-centric view of AI building: deciding what an AI system is for should not be treated as solely the developer’s domain.

When benchmarks define the target, goal-setting becomes a societal question

For Cavello, the key philanthropic opportunity is not just to fund more tests of already-built systems. It is to help shape the goals toward which AI development is optimized.

Benchmarks can function as field-shaping instruments because they make some goals measurable and visible. Cavello’s AlphaFold example is meant to show how a benchmark, challenge, or competition can identify a worthy problem and draw effort toward it. The same mechanism can operate more broadly: if a benchmark defines what “good” looks like, it can influence what is built.

What is this AI even being built to do?

B Cavello

That is why Cavello argues more people should be involved in goal definition. Their work on “community-aligned AI benchmarks” is presented as part of that effort. The underlying question is not merely whether a system performs well on a test, but “what do we actually want to achieve here?” and “what is this AI even being built to do?”

This is where philanthropy enters as more than a funder of technical infrastructure. Funders can amplify stakeholders who would otherwise have less influence over what AI systems are built to do. The mechanism is evaluation: supporting benchmarks, oversight tools, and user-facing measurements that reflect a broader set of questions than developer performance alone.

The right evaluation depends on what can be known, when, and by whom

B Cavello identifies four practical factors that determine which evaluation makes sense: cost, time, access, and goals. Each constraint changes what can be known, how quickly it can be known, and who can know it.

Cost is the most immediate constraint. Some evaluations are comparatively lightweight; others require sustained research. Cavello gives the example of a longitudinal study on how AI use changes workers’ perceptions of themselves or their relationships with employers. That kind of study may require surveying people over time and is likely to be expensive. Cavello notes that people are trying to develop new mechanisms for conducting this work, but says the ecosystem has to be realistic about constraints.

Time is closely related. Some feedback loops are too slow to guide near-term decisions. Cavello points to the lingering effects of the COVID-19 pandemic on education as an example of how long it can take to understand the impact of a large social disruption. Technology impacts may take years or decades to become clear. Because of that, evaluators may sometimes choose a less perfect measure that shortens the feedback loop.

Access also determines what can be evaluated. An AI developer may be able to inspect far more of a system than an outside investigator. Cavello contrasts internal access with the position of an investigative journalist probing a system to investigate its potential for bias. The journalist may only have access to the end-user product. They may not know the model weights, the training data, or much about the underlying system. That limits which evaluations are possible.

Goals determine the appropriate tool. An investigative journalist, a product manager, a regulator, and a user are not necessarily trying to answer the same question. A company monitoring whether a product is performing as intended may need a different evaluation from a regulator trying to define a threshold or a user trying to decide whether a tool is right for a particular task.

This is Cavello’s argument against treating “evals” as a single category with a single best method. The right evaluation depends on the question, the available access, the acceptable cost, and the time horizon.

The frontier, in your inbox tomorrow at 08:00.

Sign up free. Pick the industry Briefs you want. Tomorrow morning, they land. No credit card.

Sign up free