Introducing BlueFin: Benchmarking AI Agents on Financial Models
from building, understanding, and editing financial spreadsheets
TL;DR
- BlueFin benchmarks AI agents on finance’s central artifact: the financial model. 131 expert-authored tasks. 3,225 rubric criteria.
- Few workbooks reach a usable standard. The best single model clears the bar on just 16% of manipulation tasks.
- Even the best of five models per task clears that bar only about a third of the time.
- BlueFin grades what practitioners actually check: integration, auditability, structure, and robustness when assumptions change.
- Models look more correct than they behave. They hold up on static formulas, then break when inputs change. Across the held-out benchmark, no frontier model satisfies even half the weighted rubric.
Few workbooks reach a usable standard
Share of held-out manipulation tasks scoring above 80% on the weighted rubric (n = 75).
Overview
Everyone has an AI plugin for Excel. Anthropic with Claude for Excel, OpenAI with ChatGPT for Excel, and a long list of startups, all vying to do for finance what frontier models have done for software engineering.
Yet evaluation has lagged far behind.
Today, we’re introducing BlueFin, a benchmark designed around the central artifact of professional finance: the financial model.
Where existing benchmarks focus on numerical correctness and superficial layout, BlueFin asks whether an agent can build, extend, and reason over a model the way analysts do.
A financial model is not judged only by the values of its cells.
It has to be integrated across tabs, traceable to authoritative inputs, structured to align with professional standards, and robust to changes in assumptions. Structure isn’t cosmetic. Layout and formatting are integral to its effective use. And while firms and analysts may differ in preferred style, the convention should be internally coherent. Outputs should be easily traceable to their inputs. That means the entire dependency chain — layer upon layer of hardcodes, links, formulas — needs to be immediately legible to the next person who opens the model.
BlueFin measures all of this.
The benchmark is composed of 131 expert-authored tasks across three families:
Models struggle most on manipulation, the benchmark’s core
Criteria-weighted score (%) by task family. Non-completions count as zero. n = 120 held-out tasks.
Manipulation is the benchmark’s core: 82 of the 131 tasks ask an agent to build on an existing workbook. Our goal is to represent the day-to-day work of the professionals in investment banking and private equity we worked with, who inherit models and must understand, extend, and repair them while preserving downstream logic.
Critically, outputs are not graded solely on bounded, cell-level factuality. They are evaluated against 3,225 fine-grained rubric criteria that test whether the model is integrated, auditable, professionally structured and formatted, and robust under changing deal scenarios and assumptions.
Where BlueFin Fits
Existing spreadsheet benchmarks answer adjacent questions. SpreadsheetBench v2 evaluates end-to-end spreadsheet workflows against golden outputs. BankerToolBench (BTB) evaluates broader investment-banking workflows spanning spreadsheets, decks, and memos. BlueFin goes deep on one artifact: the financial model itself.
That depth compounds. Against BTB’s public golden Excel outputs, BlueFin’s workbooks contain approximately 9x more populated cells, 12x more formulas, and 400x more cross-sheet formulas on average. That depth lets BlueFin test the dense, multi-tab dependencies that make professional financial models not only difficult to build, but also difficult to verify.
But the difference doesn’t stop at workbook complexity. BlueFin evaluates whether a model’s internal dependencies are correct, whether its outputs can be validated, and whether it continues to work when its assumptions change.
BlueFin goes deeper than prior benchmarks
BankerToolBench spreads across decks, memos, and spreadsheets. BlueFin goes deep on one: the financial model itself, where the gap compounds.
Models look more correct than they behave
We evaluated five frontier models on the 120-task held-out set. No model clears 50%. Averaged over the weighted rubric, none satisfies even half the criteria a finance practitioner would check.
The shortfalls cluster in one area. The best models often build formulas that evaluate to the correct values in a vacuum: Formula Correctness and Presentation land in the high-30s to high-60s. But change an assumption (WACC, SOFR, case selector) and the model fails to update correctly. In the rubrics, this results tangibly in a drop to 20–48% in Output Validation and 15–37% in Perturbation. The drop reveals the central failure mode: the workbooks are not robust by construction. Hardcoded intermediates, overwritten formulas, incorrectly linked rows — invisible until the case changes.
Strongest on static criteria, weakest when inputs change
Per-section pass rate (%) across the 75 held-out manipulation tasks.
But a rubric score doesn’t directly map to usability. A 50% rubric score isn’t a 50% usable model. When you can’t tell which half is wrong without auditing the whole thing, you may be better off building the entire model from scratch.
The best frontier model scores above 80% on only 12 of 75 (16%) manipulation tasks. Across all five frontier models, only 25 tasks have even one clearing the bar; on the other 50, not a single model produces a workbook a reviewer would use.
The artifact looks like a spreadsheet; it doesn’t behave like a financial model. That’s the gap BlueFin measures.
Why this matters
AI benchmark coverage is badly out of proportion with the true population distribution of economically valuable work. Coding is heavily overrepresented relative to the rest of the digital economy. For finance in particular, some of this is explained by the dearth of high-quality artifacts from real business environments.
Coding also provides a map for where this goes. Early benchmarks like SWE-bench asked whether an agent could resolve a GitHub issue. SWE-bench Pro pushed further: given a long-horizon, enterprise-grade problem in a real, maintained codebase, could an agent write correct code? And now that agents write a majority of code at many software companies, even that is not enough. Correctness is table stakes. The question has become:
Would a maintainer merge the patch?
Finance is running a compressed version of the same arc. SpreadsheetBench v1 asked whether a model could write a correct formula. v2 raises the bar to the whole workbook: build the multi-sheet model, match the golden answer. BlueFin asks finance’s version of the maintainer question:
Would this model survive a deal process?
Today, the answer is no. Not because the models can’t write formulas, but because they can’t build models that stay correct when the inputs move. The result is akin to sycophancy in chat models providing users false confidence. Only here, mistakes result in real dollar costs in high-stakes business settings.
If you’re working on spreadsheet reasoning, post-training for financial workflows, or evaluation methodology in hard-to-verify domains, we’d love for you to build on this.