Orply.

Financial AI Moves From Chatbots To Governed Workflow Infrastructure

Applied AIMonday, June 8, 20261h 3m to watch16 min read

OpenAI finance, LSEG, Erste, Allica, Codex, and Arize all point to the same shift: AI is becoming more consequential as it is embedded into regulated workflows rather than used as a sidecar. The practical burden moves to trusted data, permissions, human review, telemetry, evaluation, and governance that can make delegated work usable in production.

AI is becoming workflow infrastructure, not a sidecar

The financial-services AI discussion has moved past the first adoption question. The consequential question is no longer whether a bank, market-infrastructure provider, or finance function can find uses for ChatGPT, Codex, agents, or model APIs. It is what has to be rebuilt around those systems before they can become part of production work.

Across the financial-services examples, the pattern is consistent: AI becomes more consequential as it moves closer to the workflow. In OpenAI’s finance organization, Stacie Faggioli describes a finance function redesigned around AI-native workflows, embedded engineering, and human-reviewed automation. At LSEG, Emily Prince argues that frontier AI only becomes useful when it is grounded in trusted data, deterministic financial models, evaluation frameworks, and governance. At Erste and Allica, bank executives describe adoption not as a chatbot rollout but as an operating-model decision: which work can be decentralized, which customer-data systems must be governed centrally, how product teams should be structured, and where agents can handle messy work that traditional software does not handle well.

The common thread is that “agentic” work inside finance is not just about an AI system taking actions. It is about the surrounding stack that makes those actions usable in a regulated environment: trusted inputs, permissions, review points, audit trails, domain-specific evaluation, security controls, and a clear answer to who is accountable when the system acts.

OpenAI’s internal finance case is the sharpest version of the speed argument. Faggioli says a PwC assessment found OpenAI’s finance team was about one-fifth the size of comparable technology peers, and she attributes that leverage to an operating model rather than a single tool. Finance experts use ChatGPT, ChatGPT for Excel, Codex, custom agents, connectors, and MCPs, while engineers sit inside finance through an Enterprise FinTech pillar instead of waiting outside as a distant IT function.

20%
OpenAI finance team size relative to comparable technology peers, according to Faggioli’s description of a PwC assessment

That number is useful less as a benchmark than as a forcing function. If a finance team can run with materially fewer people relative to output, then the institution has not merely adopted software. It has changed its production system. Faggioli’s examples make that point concrete: investor-relations agents draft diligence responses; ChatGPT for Excel builds inspectable models; Codex turns marketing data into dashboards and recommendations; agents deflect procurement questions, create credit-risk scores, flag contract terms, and compile vendor-risk reports.

The same examples also show why finance cannot treat outputs as casual suggestions. The investor-relations agent is instructed to use only publicly shared information when answering investor questions. Excel outputs remain inspectable through formulas and assumptions. Finance staff still review compute-margin slides, QA checks, and data that must pass the team’s “sniff test.” The contract-review agent flags issues, but accounting and deal-desk consequences still matter.

Placed beside Prince’s, Lee Spacagna’s, Maurizio Poletto’s, Ravneet Shah’s, Conor Spicer’s, and Dat Ngo’s arguments, Faggioli’s account points to a practical frontier in finance that is broader than one model or one assistant: a governed workflow stack.

That distinction matters for skeptical operators. If AI is a sidecar, adoption can be measured by licenses, usage, and scattered productivity gains. If AI is workflow infrastructure, adoption has to be measured by whether core work changes safely: fewer handoffs, faster decisions, better analyst preparation, reviewed code, controlled customer-data access, and observable agent behavior.

OpenAI Finance Runs at 20% of Peer Headcount With AI-Native WorkflowsOpenAI

Trusted context is the foundation, not an enhancement

LSEG’s version of the argument begins with a constraint that applies across finance: a model’s usefulness depends on the information and controls it can work with. Prince frames LSEG’s AI strategy around trusted financial information, not generic answers. The company’s value, as she describes it, sits in data, pricing, market models, risk models, clearing, exchange infrastructure, and the services that regulated financial work already depends on.

That is why “LSEG Everywhere” is not just a distribution slogan. Prince describes it as an attempt to put LSEG’s information into the tools where customers and employees already work, including ChatGPT-style environments, APIs, and Model Context Protocol. In her account, the institution should not have to choose between modern AI interfaces and the trusted data layer that financial decisions require.

33+ petabytes
of data in LSEG’s data and analytics business, according to Prince

The number sharpens the point. A general-purpose model can produce fluent answers, but LSEG’s argument is that financial-services work needs current, high-quality, workflow-specific context. Analysts, portfolio managers, bankers, finance teams, and product teams are not just asking for prose. They need data provenance, accepted models, policy context, and outputs that fit the standards of their role.

Prince’s most useful distinction is between AI as an interface over generic intelligence and AI as an interface over the actual resources a job requires. LSEG’s “no regret” investments, in her framing, are APIs and MCP because they determine whether AI tools can sit on top of trusted information without forcing each team through a long bespoke onboarding process. The model can change; the institution still needs a harness around models and applications so different tools can work over trusted data consistently.

That creates a different view of competitive advantage. In Prince’s account, the model is not the entire product. In regulated financial services, the scarce asset is often the controlled delivery of trusted context into workflows. LSEG’s approach is one version of a wider pattern also visible in the bank and agent examples: the data layer and control layer become strategic infrastructure when AI systems are delegated meaningful work and their inputs, permissions, and evaluation criteria have to be reliable enough for the task.

LayerWhat the financial-services examples require
Trusted contextFinancial data, internal documents, customer records, market models, policies, and source provenance that the workflow can rely on
Access and permissionsConnections to tools such as email, calendars, SharePoint, Teams, Salesforce, CRMs, ERPs, portals, and systems of record
Workflow integrationScheduled runs, role-specific skills, output into team channels, dashboards, spreadsheets, pull requests, and approval systems
Human reviewInspection of formulas, diligence answers, code changes, compliance drafts, risk scores, and customer-facing actions
Evaluation and telemetryTraces, spans, sessions, deterministic checks, LLM judges, golden datasets, human feedback, and business metrics
GovernanceRules for customer data, model choice, centralized versus local build rights, auditability, security, and regulatory evidence
The emerging workflow stack described across the financial-services cases

Prince is careful not to present AI as a cure for long-running finance problems. Some frontier capabilities, she says, do not apply directly until teams “go that extra mile” to make them useful inside financial-services constraints. That phrase matters because it separates the demo from production. A spreadsheet filled generatively with trusted LSEG content may save hours, but only if the system preserves source quality and institutional standards. An analyst may use more structured and unstructured data, but only if the data is reliable enough to expand the work rather than pollute it.

Evaluation frameworks are the connective tissue. Prince says she now talks about evaluation frameworks multiple times a day because scaling AI requires defining success for many persona groups: finance teams, marketing teams, product teams, engineering teams, investment bankers, portfolio managers, and research analysts. The evaluation question changes by role because the work changes by role. That is the bridge from trusted data to governed workflow: a system has to know not only what data it can access, but what “good” means in the context where the output will be used.

LSEG Grounds AI Strategy in Trusted Financial Data and ControlsOpenAI

Role-specific agents move AI from prompting to delegated work

The next layer above trusted data is workflow delegation. Spacagna’s account of ChatGPT workspace agents makes the operational distinction clear: a chatbot answers a user inside a conversation; a role-specific agent is configured to do recurring work across the systems where a team already operates.

The Chief of Staff example is legible because it is ordinary. The agent is scheduled to run at 9:00 AM, inspect calendar, email, chat, notes, and documents, prepare a concise daily brief, and post it to a Microsoft Teams channel after permission is granted. It can be extended with SharePoint for company information and shared notes, Salesforce for CRM context, and skills that encode team conventions for meeting preparation.

9:00 AM
scheduled daily run time used for the Chief of Staff agent in the demonstration

The value in that example is not simply that the agent can reason. It is that it has access, scheduling, instructions, skills, permissions, and a governed output channel. The agent knows where to look, when to run, what format to use, and where to place the result. It also asks for approval before posting into Teams on first run. Those details are the difference between a clever assistant and a piece of workflow infrastructure.

Spacagna’s broader financial-services examples show why generic assistants are unlikely to be the final form. He lists agents for KYC onboarding analysts, AML investigations analysts, commercial-banking relationship-manager associates, portfolio-performance analysts, claims service representatives, third-party risk analysts, and regulatory-reporting analysts. Those jobs do not merely require summarization. They require role-specific context, repeatable team practices, connections to internal systems, and controls over what the agent can do with the information it sees.

That is where the management problem appears. If one agent is useful, a large institution will want many. If many agents exist, the institution needs a control plane: a way to manage execution, context, evaluation, optimization, systems-of-record connections, and permissions across thousands of delegated workflows. Spacagna presents Frontier as that platform layer, sitting beneath ChatGPT Enterprise, Codex, business applications, customer-built agents, OpenAI agents, and third-party agents.

This connects directly back to LSEG’s data argument. Agents need trusted context; trusted context has to be delivered into the workflows where agents and employees operate. A KYC analyst agent is only as useful as its access to customer data, onboarding documents, policy requirements, risk flags, case histories, and review processes. An AML investigations agent has a different context and a different evaluation problem. A portfolio-performance analyst agent needs market data, portfolio holdings, attribution logic, and presentation standards. Treating all of those as “chat” misses the operational layer.

Model Context Protocol appears in several of the examples because it represents one answer to the same delivery problem: how to make institutional context available to AI systems without rebuilding every integration from scratch. But the protocol is not the full answer. Once agents can access systems, the harder questions become who grants permission, how skills are maintained, how outputs are reviewed, how failures are detected, and how the institution retires or updates agents that no longer match policy or practice.

Role-Specific Agents Move AI From Prompting Into Financial Services WorkflowsOpenAI

Banks are splitting fast experimentation from governed customer systems

Erste and Allica show the bank version of the same operating-model problem. Both institutions are adopting AI, but neither presents the work as a simple tool rollout. They are deciding where experimentation belongs, where central control is necessary, how customer trust is protected, and how product-engineering work itself should change.

Poletto’s account of Erste is the cleaner governance story. Erste allows countries to move quickly on employee productivity tools: internal chatbots, document work, policies, working instructions, and other employee-facing uses where local teams can learn without making the group function a bottleneck. Customer-facing AI is treated differently. If AI touches customer data or appears inside the bank’s digital customer environment, Poletto says it has to be centralized and done properly.

The line is not only regulatory. Poletto describes banking as an industry built on trust. Customer data is among the most sensitive information a bank holds, and a poor experience or misuse of data can damage trust in ways that are hard to recover. Erste chose to connect customer data early, even though Poletto calls that the hard way and acknowledges mistakes. His argument is that if customer-data integration is ultimately necessary, the institution may be better off learning that hard problem directly rather than avoiding it through easier public-content use cases.

Erste’s customer-facing work also complicates the assumption that conversational banking is automatically better banking. Poletto says many customers like AI but do not know what to ask it. They may be comfortable using George, Erste’s digital banking app, as a functional tool for balances, transfers, and tasks, while not yet having a mental model for conversing with the bank. Erste has therefore tested both reactive patterns, where the customer initiates the conversation, and proactive patterns, where the bank nudges the customer with prompts that show what kinds of conversations are possible.

20%
of Erste customers Poletto said currently receive financial advice through branch engagement

That statistic reframes the customer opportunity. Poletto’s strategic interest is not conversation for its own sake. He says Erste currently provides financial advice to roughly the customers who come into branches, while the majority use the bank’s tools functionally. AI is useful if it extends useful advice to customers who do not receive it today, not if it merely replaces a well-designed button with a chat bubble.

Allica’s account, from Shah, is more explicitly about operating-model redesign. Shah says the UK SME bank’s AI program has moved beyond isolated use cases toward adoption across the organization and changes in how product teams are structured. Allica moved from roughly 25% adoption last year to a median workday adoption rate of 77%, but Shah distinguishes adoption from strategy. His three layers are broad organizational adoption, product-engineering operating model, and AI in the product layer itself.

77%
median workday AI adoption rate at Allica, according to Shah

The engineering changes are concrete. Allica has moved away from a fixed Spotify-style squad model toward smaller “squadlets,” fewer handoffs, merged roles, and experiments with product engineers who can bridge product and engineering work. Backend, frontend, and SDET roles have been combined in some contexts. Product-owner and product-analyst responsibilities are being merged in others. Shah says product and design colleagues should increasingly be able to ship code to production.

The bank’s lending example shows why agents matter in workflows that traditional software struggles to discipline. Allica receives asset-finance applications through brokers and introducers, often by email, often incomplete, despite having portals. Rather than force brokers into the bank’s ideal channel, Allica uses an agent to inspect incoming information, identify what is missing, and ask brokers for additional information before feeding the application into the portal. Shah says some applications have seen time to decision reduced to less than seven or twelve minutes.

Operating choiceErsteAllica
Fast experimentationLocal country teams can move quickly on employee productivity tools and internal document use casesCompany-wide adoption is pushed across operations, distribution, technology, product, finance, and other functions
Customer-data boundaryCustomer-facing AI that touches customer data is centralized and slowerAgentic workflows operate inside regulated lending processes with harnesses and a mix of deterministic and nondeterministic agents
Customer interfaceConversational banking is tested, but Poletto questions whether conversation is always better than existing app interactionsRelationship managers remain the human interface; AI prepares context and insights behind them
Product-engineering modelAI is built as a platform within the existing digital banking platform, with rebuilds expectedSmaller squadlets, fewer handoffs, merged roles, and more direct production participation from product and design
Success metricBetter access to advice and customer-grade digital experience under banking constraintsUseful product increments for customers and internal functions, not deployment count alone
Erste and Allica describe different versions of the regulated-bank operating split

Allica’s relationship-manager example resists the easy “replace the banker with a bot” narrative. Shah says the bank does not want to replace relationship managers with AI bots. It wants AI to prepare them with better context so customer conversations are higher quality. The human relationship remains the interface; AI improves the preparation layer behind it.

The regulated-bank lesson from Poletto and Shah is practical. Productivity tools can often be decentralized. Customer-facing AI, especially when customer data is involved, often needs central governance. Product-engineering teams may need to shrink, merge roles, and remove handoffs to capture speed. Agents are especially useful where messy, incomplete human behavior prevents traditional software from working cleanly. Adoption metrics matter, but the more serious measure is whether AI produces useful product increments without weakening risk, compliance, security, or customer trust.

Erste Builds AI as a Governed Platform Inside Digital BankingOpenAI
Allica Bank Pushes AI Beyond Use Cases Into Operating ModelOpenAI

AI compresses software delivery, but review becomes the bottleneck

The same operating-model pressure appears inside software delivery. Spicer’s Codex-for-banks example is useful because it does not stop at code generation. It follows a customer-facing feature through requirements, implementation, testing, compliance evidence, legacy portal work, pull-request review, and final human approval.

Spicer frames Codex as an agent that changes the unit of software work. It is not only autocomplete and not only a chat surface for developers. In his description, Codex can inspect a codebase, read documentation, use connected tools, run for hours, modify code, run tests, and prepare surrounding artifacts. Developers still prompt, steer, inspect, and approve. The work shifts from typing each line to supervising larger chunks of agent-executed software work.

The fictional Blossom Bank predictive-budgeting feature makes the chain concrete. Customer demand appears as a request for the bank to move beyond retrospective spending dashboards and warn users before they go over budget. Requirements live in SharePoint under a file called “PMO-Monthly_Forecast.” Codex retrieves the approved context, inspects the mobile app architecture, proposes an implementation plan, and waits for approval before changing code. Once authorized, it updates the app, wires in forecast fixtures, swaps the dashboard card UI, and runs typecheck.

  1. Customer demand
    Users ask Blossom Bank for predictive budgeting rather than another historical spending dashboard.
  2. Approved requirement
    The feature request exists in SharePoint as PMO-Monthly_Forecast, giving Codex approved product context.
  3. Implementation plan
    Codex retrieves the requirement, inspects the app architecture, and proposes how to implement the feature.
  4. Code and tests
    After approval, Codex modifies the mobile app, adds deterministic forecast fixtures, updates the UI, and runs checks.
  5. Compliance draft
    Codex uses evidence and browser automation to populate a legacy change-portal draft without submitting it.
  6. Pull-request review
    Codex participates in review, including identifying a potential sensitive-field handling issue missed by human reviewers.

The compliance-portal step is important because it treats institutional friction as part of the work, not an exception to automation. Spicer shows a legacy-looking change portal with fields such as customer segment and risk rating. Codex can use a skill and browser automation to search code, documentation, and evidence, populate the form, validate it, and save a draft. Spicer says Codex will not simply hit submit. The human checkpoint remains at the final decision.

That human checkpoint is not incidental. Higher code volume creates an institutional problem. If AI helps teams ship more, then security, review, audit trails, compliance workflows, and deployment processes have to scale too. Spicer makes that constraint part of the case: Codex can also review pull requests and identify security issues, such as a possible mishandling of sensitive fields that human reviewers missed. But the broader point is that the review system must absorb the speed created by AI-assisted development.

This is the software-engineering analogue to the agent-control problem. A single developer using Codex may move faster. A bank with many teams using Codex will generate more diffs, more review demand, more compliance evidence, more release coordination, and more questions about which changes are safe to ship. The limiting factor shifts from code production to the institution’s ability to review, secure, integrate, audit, and deploy the increased volume of work.

Spicer’s framing avoids the simplistic replacement story. Developers remain in the loop, but the loop changes. They prompt the agent, validate context, inspect plans, review diffs, steer fixes, and approve final actions. The bank gets leverage only if its process can turn that supervisory mode into production-grade delivery rather than a pile of unreviewed generated code.

Banks Can Use AI Agents to Turn Requirements Into Reviewed FeaturesOpenAI

Telemetry becomes the record of what agents actually did

The final constraint cuts across every other layer. If financial institutions rely on AI agents for analyst work, lending intake, finance operations, software delivery, customer support, regulatory reporting, or internal controls, they need a way to know what the system actually did. Ngo’s observability argument supplies the technical language for why ordinary code review is not enough.

Ngo’s central point is that production AI failures may live in the path, not the part. An agent may call the right tools in the wrong order, loop through a bad branch, lose state across a session, or take different routes across runs. A bad output may not come from a broken component. It may come from the trajectory: tool B was called before tool A even though B depended on A; a session drifted; a branch with high latency handled too much traffic; a multi-agent handoff lost required data.

Traces and spans therefore become the audit record for nondeterministic systems. Ngo says code does not audit agents or harnesses; telemetry does. That matters especially in finance, where it is not enough to know that an answer was wrong. A bank, market-data provider, or finance team needs to know which data was accessed, which tools were called, in what order, with what inputs and outputs, under which permissions, and where the workflow branched.

Observability has to operate at several levels. A trace-and-span view shows one run. A session view shows the back-and-forth state of a user interaction over time. A distributional view shows how an agent behaves across many runs: which paths it takes, where it loops, where latency concentrates, and which branches perform poorly. Dashboards still matter, but they are not the whole record. They sit alongside traces, sessions, trajectories, and path views.

Evaluation then has two dimensions: type and scope. Ngo identifies five evaluation signals: LLM-as-judge, human feedback, golden datasets, deterministic checks, and business metrics. The type of evaluation must fit the question. If the system must produce valid JSON, a deterministic schema check may be better than asking another model to judge it. If the system must match a domain expert’s standard, a golden dataset or human annotation may be needed. If the system is supposed to save time, reduce cost, or increase revenue, business metrics belong in the evaluation loop.

Scope is equally important. A span-level eval can assess one model call or tool output. A multi-span eval can test whether components passed information correctly. A trajectory eval can determine whether the agent followed the right sequence to complete a process. A session eval can ask whether the overall interaction satisfied the user or left questions unresolved.

ScopeWhy it matters for financial-services agents
SpanChecks one component, such as whether a generated field, formula, or extracted document value is correct.
Multi-spanTests whether data moved correctly across handoffs, such as from intake email to lending portal or from requirement document to code change.
TrajectoryEvaluates whether the agent took the right sequence of actions, such as calling dependent tools in the proper order before producing a decision or draft.
SessionAssesses the full stateful interaction, such as whether a customer, analyst, or employee completed the intended job across multiple turns.
Ngo’s evaluation-scope distinction applied to regulated financial workflows

This is where the financial-services examples converge. LSEG’s trusted-data strategy requires evaluation frameworks to ensure AI outputs fit each persona’s work. Role-specific agents need monitoring once they connect to calendars, email, CRM, SharePoint, internal apps, and systems of record. Erste’s customer-facing AI needs governance because customer data and trust are on the line. Allica’s lending agents need harnesses because applications arrive messy and incomplete. Codex-assisted software delivery needs review and compliance evidence because more code creates more institutional risk. OpenAI finance’s agents need human review because investor relations, credit risk, contract terms, accounting, and procurement are not casual outputs.

Ngo’s account pushes that logic one layer deeper: governance cannot rely only on written policy or code inspection when agent behavior is nondeterministic. The institution needs telemetry, evaluation, experiments, human feedback, golden datasets, deterministic checks, and business metrics wired into a loop. Otherwise, teams may know that an output failed without knowing whether the failure came from bad context, wrong tool order, poor retrieval, a model judgment error, an incomplete session, or a workflow branch nobody was watching.

Telemetry, Not Code, Audits Nondeterministic AI AgentsAI Engineer

The frontier, in your inbox tomorrow at 08:00.

Sign up free. Pick the industry Briefs you want. Tomorrow morning, they land. No credit card.

Sign up free