A 4B Model Beat Qwen3 235B by Learning Tool Discipline

Kobie CrawfordAI EngineerWednesday, June 10, 20269 min read

Kobie Crawford of Snorkel argues that some enterprise AI failures are less about model size than about whether models behave correctly inside constrained tool environments. In Snorkel’s FinQA work with UC Berkeley’s rLLM/Agentica, a 235B Qwen model hallucinated a financial answer after failed SQL calls, while a 4B model fine-tuned with reinforcement learning learned to inspect tables, correct errors and calculate from retrieved data. Crawford presents the result as evidence that targeted RL, structured evals and behavior-specific training can outperform simply moving to a larger model for this class of financial analysis task.

The failure was not reasoning depth; it was tool discipline

Kobie Crawford framed the work as a challenge to a common enterprise pattern: when a model underperforms on a production task, teams often reach for a larger model on the assumption that more general capability and deeper reasoning will solve the problem. In financial analysis, he argued, that can be the wrong intervention. The task may not require a broadly more capable model; it may require a model that behaves correctly inside a constrained tool environment.

The specific target was financial question answering that requires tool use: discovering available tables, inspecting schemas, executing SQL across financial data, retrieving the right numbers, and then performing arithmetic to produce a verifiable answer. Crawford’s claim was not that large models are intrinsically unnecessary. It was narrower: for this class of enterprise task, the bottleneck can be procedural behavior rather than core language knowledge or mathematical intelligence.

That distinction matters because the production constraints are not academic. Crawford pointed to cost, speed, security, control, and the path from proof of concept to deployment. A team may prototype with a large capable model, then face the practical question of how to run the system in production. Smaller models are cheaper at inference time, require less compute, can be faster, and are easier to deploy on-premises or inside an enterprise boundary where financial or healthcare data cannot be casually sent to an external inference API.

The Snorkel and UC Berkeley rLLM/Agentica work tested whether reinforcement learning could make a small model behave well enough in this environment to match or beat a far larger one. Crawford described RL as the right phase of training because the desired change was behavioral: use the available tools in the right order, inspect before querying, recover from errors, and calculate from retrieved facts rather than guessing.

He used the rLLM team’s phrase “the Terence Tao effect” to describe the mismatch. A very large model may have the equivalent of extreme general mathematical capability, but a financial analyst’s tool-use task does not require “all the kinds of math.” It may require disciplined SQL, schema inspection, and arithmetic. A sledgehammer can crack a walnut, but it is not necessarily the best tool for the job.

A 235B model guessed tables, failed twice, then hallucinated

The motivating failure case was a question: “What is the YoY growth rate of YouTube ads revenue from 2023 to 2024?” In the Snorkel FinQA environment, Qwen3-235B was given tools it could have used to discover tables and inspect schemas. According to the trace Crawford showed, it did not begin by calling the table-discovery tool. It guessed a table name, queried revenue_breakdown, and received an error that the table could not be found.

It then guessed again, querying us_gaap_RevenueTable for a YouTube segment. That table also could not be found. After two failed tool calls, the model produced an answer anyway: “YouTube ads revenue grew approximately 20-25% year-over-year in 2024...” The displayed trace noted that the actual value was about 14.7%.

The important point in Crawford’s telling was not merely that the model was wrong. It was wrong in a way that exposed an operational failure mode. The model skipped schema inspection, failed to learn table structures and column names before querying, executed poorly formed SQL, retrieved no usable data, and then fabricated a response. As one slide put it: “Greater reasoning capability does not guarantee better task performance when tool use is undisciplined.”

Crawford contrasted this with what the task actually required. The model did not need more abstract reasoning to know how to compute year-over-year growth once it had the numbers. It needed to use the environment correctly: discover the table, inspect its columns, query the correct row and year columns, then calculate the percentage change.

The training run was small, structured, and behavior-specific

Snorkel’s approach began with a curated question-answer dataset built for the financial analysis environment. Crawford emphasized Snorkel’s data-quality process: expert contributors are brought into the loop, including domain experts such as financial analysts, PhD-level specialists, and experienced industry practitioners. The objective was not just to generate examples, but to verify that the tasks were answerable and correctly specified.

The dataset included both single-table and multi-table examples. The single-table set contained 4,030 training examples, 522 validation examples, and 558 test examples. The multi-table set contained 991 training examples, 126 validation examples, and 131 test examples. Multi-table tasks required using two to five tables and computing 10 to 20 metrics.

The verification process checked whether referenced columns and rows existed in the source table, whether query results were correct, and whether mathematical calculations were correct. Crawford presented that verification as central to the result: the model needed examples that taught the behavior the researchers wanted, not just more data in the abstract.

Dataset	Train	Validation	Test	Task shape
Single table	4,030	522	558	Queries over individual tables
Multi-table	991	126	131	Requires 2–5 tables and 10–20 computed metrics

Snorkel’s FinQA training data was split between single-table and multi-table financial QA tasks.

The RL setup was similarly concrete. The base model was Qwen3-4B-Instruct-2507. Training used GRPO with a binary reward: 1 for a correct answer, 0 for an incorrect one. The framework was rLLM, using Snorkel’s FinQA environment. The run used 1,024 concurrent environments and an LLM judge, gpt-4o-mini, to evaluate correctness. Compute was 8 H100s for about 21 hours, at a cost under $500 per run.

Under $500

cost per RL training run for the 4B FinQA model

Crawford’s broader point was that reinforcement learning for useful model behavior does not have to be reserved for extremely expensive frontier-scale training. For teams already considering smaller hosted models, on-prem deployment, or controlled production systems, he presented this as evidence that targeted RL can be tractable.

The environment forced the model to solve the problem through tools

The FinQA environment Crawford described was built around four specialized tools. get_table_names discovers available tables for a specific company. get_table_info inspects schemas, columns, and sample values. sql_query executes SQL over financial tables. calculator evaluates mathematical expressions.

This structure matters because the environment makes correctness dependent on a chain of observable actions. A model is evaluated on whether it can identify the table, understand its schema, retrieve the relevant values, and compute the answer, rather than simply producing a plausible response from latent knowledge.

The environment also included benchmarks. Snorkel FinQA had 290 samples. FinQA-Reasoning, described as harder and requiring multi-table reasoning, had 79 samples. Crawford said the environment was self-contained, with no remote external data dependency required once deployed. He also said it had been published and was available through Prime Intellect infrastructure, OpenEnv, GitHub, and Hugging Face Spaces.

Tool	Purpose
get_table_names	Discover available tables for a specific company
get_table_info	Inspect schemas, columns, and sample values
sql_query	Execute SQL queries over financial tables
calculator	Evaluate mathematical expressions

The FinQA environment constrained financial QA into discover, inspect, query, and calculate steps.

Crawford compared this kind of structured environment to other rollout environments developers may have used, and argued that getting started with RL is becoming easier as these environments become more accessible.

The fine-tuned 4B model won by inspecting, correcting, and calculating

The headline result was that the fine-tuned 4B model, called rLLM-FinQA-4B, improved Pass@1 on the Snorkel FinQA benchmark from 27.9% to 59.7%. The 235B Qwen3 model scored 51.4%. Crawford summarized the result as a well-trained 4B model beating a 235B model once tool use was fixed.

Model	Size	Pass@1 on Snorkel FinQA
Qwen3-4B-Instruct-2507	4B	27.9%
Qwen3-235B-A22B	235B	51.4%
rLLM-FinQA-4B	4B	59.7%

The RL-trained 4B model outperformed the larger 235B model on the Snorkel FinQA benchmark.

The more revealing result was the behavioral trace on the same YouTube ads revenue question. Unlike the 235B model, rLLM-FinQA-4B began by calling get_table_names(company_name="youtube"). It received a list of available tables, including us_gaap_DisaggregationOfRevenueTableTextBlock. It then called get_table_info on that table and saw a schema with revenue_type and year columns for 2022, 2023, and 2024. The sample values included “youtube ads.”

The model then made a mistake. It issued a SQL query asking for a revenue column, but the table did not have that column; the years were columns. The environment returned an error: the column was not found, and the available columns included revenue_type, 2022, 2023, and 2024.

The trained model did not stop or hallucinate. It corrected the query, selecting "2023" and "2024" from the table where revenue_type = 'youtube ads'. The returned values were 31,510 for 2023 and 36,147 for 2024. It then called the calculator with (36147 - 31510) / 31510 * 100, receiving 14.71, and answered 14.7%.

That trace was the core of Crawford’s argument. The model was not flawless; it still made a SQL mistake. But the learned behavior was robust enough to recover. It discovered the available tables, inspected the schema, observed the error, corrected the query, and used a calculator rather than improvising a number.

Single-table training transferred to harder multi-table reasoning

The ablation result Crawford called surprising was that training only on single-table data performed best. The researchers compared three recipes: single-table only, single plus multi-table, and curriculum learning that began with single-table tasks and progressively added multi-table tasks. On the internal Pass@1 test, single-table only reached 66.3%, compared with 61.6% for mixed single-plus-multi training and 64.8% for the curriculum approach.

Training recipe	Internal Pass@1 test
Single table only	66.3%
Single + multi-table	61.6%
Curriculum: single → multi	64.8%

In the ablation Crawford showed, single-table-only training produced the strongest internal Pass@1 result.

The second surprise was that the single-table-trained model also improved on the harder multi-table FinQA-Reasoning benchmark. The base Qwen3-4B-Instruct-2507 scored 13.9%. Qwen3-235B-A22B scored 18.9%. rLLM-FinQA-4B scored 26.6%.

Model	Size	Pass@1 on FinQA-Reasoning
Qwen3-4B-Instruct-2507	4B	13.9%
Qwen3-235B-A22B	235B	18.9%
rLLM-FinQA-4B	4B	26.6%

The 4B model trained for tool discipline on simpler examples generalized to the harder multi-table reasoning benchmark.

Crawford interpreted this as evidence that the core bottleneck was not multi-step reasoning in the abstract. The model’s larger failure was poor tool discipline: over-querying, bad SQL, inefficient tool use, and failure to inspect the environment before acting. Training on single-step tool correctness fixed enough of that core behavior to generalize to more complex financial reasoning tasks.

The result does not imply that multi-table reasoning is easy or that larger models never matter. Crawford’s narrower conclusion was that teams should identify the specific behavior causing failure before assuming the answer is model size. In this case, disciplined tool use moved the smaller model past the larger one.

Rubric-based evals tell teams what data to write

Crawford closed by connecting the result to a broader data and evaluation practice at Snorkel: break evals into rubrics that expose which behavior failed. A final yes-or-no correctness score is useful for reinforcement learning, especially when GRPO consumes a single reward value. But a single score does not tell a team what dataset to build next.

A richer rubric can decompose the model’s response into separate questions. Did it inspect the schema? Did it use the right table? Did it reference existing columns? Did it form valid SQL? Did it retrieve the right values? Did it calculate correctly? Did it hallucinate after an error? By looking at those dimensions separately, Crawford said teams can identify the behavior to target with training data.

That is the practical lesson he drew from the FinQA work. The researchers did not simply ask whether the small model was weaker than the large model. They looked at the failure mode, found that tool use was the bottleneck, and trained the smaller model on data that reinforced the right behavior.

The resulting recommendation was to move away from treating large monolithic models as the only path to better performance. Crawford argued for modular, role-based systems, smaller models trained for specific behaviors such as tool use, RL plus structured environments to enforce correctness, and richer benchmarks that reflect real-world complexity.

AI Application Architecture Data and Training Evals and Benchmarks AI Research Methods Agents and Autonomy