Major Chatbots Fail Forum AI Tests on Election News Accuracy
Forum AI CEO Campbell Brown told Bloomberg Technology that major chatbots are failing basic tests on news, elections, and geopolitics because model companies have not prioritized measuring those tasks. Citing Forum AI’s NewsBench study of more than 3,100 prompts across ChatGPT, Gemini, Claude, and Grok, Brown said the systems showed high rates of factual error, ideological bias, and weak sourcing, including reliance on state-run media. Her proposed fix is independent evaluation, rather than AI companies “grading their own homework.”

Forum AI says chatbot news failures are being under-measured
Forum AI tested four major chatbots on news-related questions and found broad failures across accuracy, bias, and sourcing, according to CEO Campbell Brown. The systems were identified on screen as OpenAI’s ChatGPT, Google’s Gemini, Anthropic’s Claude, and xAI’s Grok. Brown said the study, called NewsBench, used “judgment models” trained with senior domain experts, who designed benchmarks across three dimensions: factual accuracy, general bias, and the quality of sources the models relied on.
| Forum AI measure | Reported finding |
|---|---|
| Study scope | 3,100+ prompts across major AI models |
| Election-related chatbot prompts | 90% failure rate |
| Foreign-policy answers | 35% relied on state-run media |
| Basic finance and market questions | 30% factual error rate |
Forum AI reported a 90% failure rate on election-related chatbot prompts. Its on-screen report card said the study covered more than 3,100 prompts across major AI models, found that 35% of foreign-policy answers relied on state-run media, and found a 30% factual error rate on basic finance and market questions.
Brown’s explanation for the weakness was not that news is impossible for AI systems, but that it has not been the focus of how leading model companies measure themselves. Most benchmarks around AI models, she said, have emphasized coding, math, and general model capability. That emphasis “makes sense” commercially, because those are areas where model companies are making money. But the same companies are also marketing chatbots as consumer products, and users are asking them a much broader range of questions.
“Certainly in an election year,” Brown said, political information is going to be important, and the models “are not right now where they need to be.”
You can't fix something if you haven't measured it.
Brown said the likely reason chatbots struggle with news accuracy is that it “hasn’t been a priority.” She expects that to change as consumers demand better answers and, more importantly from the companies’ business perspective, as enterprise customers begin to demand higher accuracy. Being “just okay” on news, politics, and geopolitics, she said, “is not going to cut it.”
Election answers failed on facts and ideological balance
Ed Ludlow pressed Brown on whether there is evidence that people are using chatbots for news in the first place. Brown said usage is increasing, citing unspecified studies showing more people turning to chatbots for news. She framed the issue as practical rather than abstract: people are asking chatbots who their candidates are, what those candidates believe, and who they should vote for.
Forum AI found that about one-third of the election-related questions it asked produced factual errors, Brown said. But the more sweeping problem, in her account, was bias: “all of the chatbots failed on bias.”
Her breakdown distinguished the models sharply. Claude and Gemini gave left-leaning answers on election-related questions 100% of the time, she said. ChatGPT did so 95% of the time. Grok was the only model Brown described as right-leaning, giving right-leaning answers on those questions about 85% of the time.
| Chatbot | Forum AI finding on election-related bias |
|---|---|
| Claude | Left-leaning answers 100% of the time, according to Brown |
| Gemini | Left-leaning answers 100% of the time, according to Brown |
| ChatGPT | Left-leaning answers 95% of the time, according to Brown |
| Grok | Right-leaning answers about 85% of the time, according to Brown |
Brown’s concern was not simply that models occasionally make mistakes. It was that users are not receiving “a straight-up answer or a balanced perspective” on election questions. The study treated source selection as part of that failure. Ludlow introduced the report by saying the chatbots were “routinely serving up Russian and Chinese state media as authoritative sources,” and Forum AI’s on-screen report card said 35% of foreign-policy answers relied on state-run media.
Brown identified one reason for optimism: the models did not all perform identically. She said Gemini handled “a lot of the questions better” than some other models overall. In her account, that variation showed room for improvement. The models can perform differently if companies measure the right failures and optimize against them.
Independent evaluation is Brown’s proposed fix
The evaluation problem, in Brown’s telling, is that model companies are largely evaluating their own systems. A displayed Forum AI quote from her said, “The model companies are essentially grading their own homework,” and Brown repeated that point when asked whether frontier labs and companies such as Meta are taking action on news quality.
Brown said she speaks regularly with people at the major AI labs and believes they do care about the problem. She also said they are beginning to approach it with a different way of thinking. But she argued that self-evaluation is insufficient: “there’s not independent evaluation” of how the models perform on news, elections, politics, and geopolitics.
The model companies are essentially grading their own homework.
She was careful to say she was not calling for regulation, noting that Forum AI is itself a private company. Her preferred model is an ecosystem of companies and nonprofits conducting independent evaluation and sharing results. In her view, model companies that “lean into this” will ultimately build more trustworthy products.
Brown also pointed to emerging pressure outside the labs. Enterprise demand, she said, is moving toward better independent evaluation. She added that some states are already passing laws requiring independent evaluation, and she expects the issue to become increasingly important.
News accuracy, in Brown’s account, cannot remain secondary while chatbots are marketed as broad consumer products. If people are using them for election questions, then measuring only traditional capability benchmarks leaves a major category of risk underexamined. Forum AI’s position is that the first corrective step is not a promise from the labs, but external measurement of where the systems fail.



