BlueFin
n = 131 tasks · 3,225 criteriafrom building, understanding, and editing financial spreadsheets
Frontier models, graded the way analysts are: on financial models that must survive a deal process. Updated as new models ship.
Last updated 06 / 09 / 26 · LM-judge graded · α = 0.826 vs. expert consensusStrongest: Presentation 66.9 · Weakest: Pitfalls 20.9. Code-driven writes are harder to self-verify. Pitfalls is a penalty section — a higher pass rate means fewer detected errors.
Strongest: Presentation 64.6 · Weakest: Pitfalls 26.1. Best synthesis builder in the field. Pitfalls is a penalty section — a higher pass rate means fewer detected errors.
Strongest: Formula Correctness 68.2 · Weakest: Perturbation 33.5. Fewest error-penalty triggers among frontier labs. Pitfalls is a penalty section — a higher pass rate means fewer detected errors.
Strongest: Presentation 61.5 · Weakest: Perturbation 24.4. Returns read-only trajectories on ~8% of tasks. Pitfalls is a penalty section — a higher pass rate means fewer detected errors.
Strongest: Pitfalls 48.8 · Weakest: Perturbation 14.8. Produces less content overall, so fewer pitfall triggers. Pitfalls is a penalty section — a higher pass rate means fewer detected errors.