Meridian · Leaderboards

BlueFin

n = 131 tasks · 3,225 criteria

from building, understanding, and editing financial spreadsheets

Frontier models, graded the way analysts are: on financial models that must survive a deal process. Updated as new models ship.

Last updated 06 / 09 / 26 · LM-judge graded · α = 0.826 vs. expert consensus
Held-out set · n=120 tasks
#
Model
Synthesis (n=9)59.9
Manipulation (n=75)48.2
Interrogation (n=36)50.0
Cost per task$8.85
Formula Correctness62.7
Model Integration48.8
Output Validation43.3
Perturbation37.3
Presentation66.9
Pitfalls20.9

Strongest: Presentation 66.9 · Weakest: Pitfalls 20.9. Code-driven writes are harder to self-verify. Pitfalls is a penalty section — a higher pass rate means fewer detected errors.

Synthesis (n=9)66.7
Manipulation (n=75)49.4
Interrogation (n=36)44.4
Cost per task$49.21
Formula Correctness59.5
Model Integration53.6
Output Validation48.1
Perturbation32.9
Presentation64.6
Pitfalls26.1

Strongest: Presentation 64.6 · Weakest: Pitfalls 26.1. Best synthesis builder in the field. Pitfalls is a penalty section — a higher pass rate means fewer detected errors.

Synthesis (n=9)50.3
Manipulation (n=75)46.1
Interrogation (n=36)44.4
Cost per task$5.78
Formula Correctness68.2
Model Integration61.1
Output Validation42.9
Perturbation33.5
Presentation54.8
Pitfalls45.3

Strongest: Formula Correctness 68.2 · Weakest: Perturbation 33.5. Fewest error-penalty triggers among frontier labs. Pitfalls is a penalty section — a higher pass rate means fewer detected errors.

Synthesis (n=9)53.2
Manipulation (n=75)43.0
Interrogation (n=36)40.3
Cost per task$21.00
Formula Correctness60.8
Model Integration58.3
Output Validation43.2
Perturbation24.4
Presentation61.5
Pitfalls40.9

Strongest: Presentation 61.5 · Weakest: Perturbation 24.4. Returns read-only trajectories on ~8% of tasks. Pitfalls is a penalty section — a higher pass rate means fewer detected errors.

Synthesis (n=9)46.9
Manipulation (n=75)27.8
Interrogation (n=36)34.7
Cost per task$4.10
Formula Correctness48.6
Model Integration50.0
Output Validation20.3
Perturbation14.8
Presentation38.1
Pitfalls48.8

Strongest: Pitfalls 48.8 · Weakest: Perturbation 14.8. Produces less content overall, so fewer pitfall triggers. Pitfalls is a penalty section — a higher pass rate means fewer detected errors.

BlueFin
+