Meridian · Leaderboards

BlueFin

Name: BlueFin
Creator: Meridian

n = 131 tasks · 3,225 criteria

from building, understanding, and editing financial spreadsheets

Frontier models, graded the way analysts are: on financial models that must survive a deal process. Updated as new models ship.

Read the paper Dataset Harness Announcement

Last updated 06 / 09 / 26 · LM-judge graded · α = 0.826 vs. expert consensus

=RANK(C1,$C$1:$C$5,0)

Held-out set · n=120 tasks

#Rank

Model

GPT-5.5

OpenAI

49.6

59.9

48.2

50.0

$8.85

Synthesis (n=9)59.9

Manipulation (n=75)48.2

Interrogation (n=36)50.0

Cost per task$8.85

Formula Correctness62.7

Model Integration48.8

Output Validation43.3

Perturbation37.3

Presentation66.9

Pitfalls20.9

Strongest: Presentation 66.9 · Weakest: Pitfalls 20.9. Code-driven writes are harder to self-verify. Pitfalls is a penalty section — a higher pass rate means fewer detected errors.

Opus 4.7

Anthropic

49.2

66.7

49.4

44.4

$49.21

Synthesis (n=9)66.7

Manipulation (n=75)49.4

Interrogation (n=36)44.4

Cost per task$49.21

Formula Correctness59.5

Model Integration53.6

Output Validation48.1

Perturbation32.9

Presentation64.6

Pitfalls26.1

Strongest: Presentation 64.6 · Weakest: Pitfalls 26.1. Best synthesis builder in the field. Pitfalls is a penalty section — a higher pass rate means fewer detected errors.

Gemini 3.1 Pro

Google

45.9

50.3

46.1

44.4

$5.78

Synthesis (n=9)50.3

Manipulation (n=75)46.1

Interrogation (n=36)44.4

Cost per task$5.78

Formula Correctness68.2

Model Integration61.1

Output Validation42.9

Perturbation33.5

Presentation54.8

Pitfalls45.3

Strongest: Formula Correctness 68.2 · Weakest: Perturbation 33.5. Fewest error-penalty triggers among frontier labs. Pitfalls is a penalty section — a higher pass rate means fewer detected errors.

Sonnet 4.6

Anthropic

42.9

53.2

43.0

40.3

$21.00

Synthesis (n=9)53.2

Manipulation (n=75)43.0

Interrogation (n=36)40.3

Cost per task$21.00

Formula Correctness60.8

Model Integration58.3

Output Validation43.2

Perturbation24.4

Presentation61.5

Pitfalls40.9

Strongest: Presentation 61.5 · Weakest: Perturbation 24.4. Returns read-only trajectories on ~8% of tasks. Pitfalls is a penalty section — a higher pass rate means fewer detected errors.

Grok 4.20

xAI

31.3

46.9

27.8

34.7

$4.10

Synthesis (n=9)46.9

Manipulation (n=75)27.8

Interrogation (n=36)34.7

Cost per task$4.10

Formula Correctness48.6

Model Integration50.0

Output Validation20.3

Perturbation14.8

Presentation38.1

Pitfalls48.8

Strongest: Pitfalls 48.8 · Weakest: Perturbation 14.8. Produces less content overall, so fewer pitfall triggers. Pitfalls is a penalty section — a higher pass rate means fewer detected errors.

BlueFin

Click a row for the rubric-section breakdown