Did They Nerf Claude?

GPQA Diamond benchmark tracking across Claude models, updated twice daily. GitHub

Why this exists: Model providers sometimes ship regressions. Anthropic acknowledged it, Stanford measured it in GPT-4. The only way to know for sure is to keep running the same benchmark on the same models, automatically.

This dashboard tracks Claude models on GPQA Diamond (50 expert-level science questions per run) via the Claude Code CLI — using a regular Claude subscription, no API keys needed.

Note on the data: 27 rows have been removed from this history to keep the experiment consistent. Six early runs used 200 samples instead of 50 (different config), and 21 rows were affected by Claude subscription weekly usage-cap hits or partial backend failures where generations did not complete properly. Only completed n=50, 1-epoch runs remain. The trend is still informative but not exhaustive.
Run it yourself with your Claude subscription
# Requires: Claude Code CLI (npm i -g @anthropic-ai/claude-code)
uv pip install eval-claude

# Run 50 GPQA Diamond questions with Sonnet
uvx --from eval-claude inspect eval inspect_evals/gpqa_diamond \
  --model claude-code/claude-sonnet-4-6 \
  --limit 50 \
  -T epochs=1 \
  -M max_connections=10
Smoothing

Accuracy

Token Usage

Estimated Cost (API pricing)

Cost estimates based on Anthropic API pricing via LiteLLM. Actual runs use a Claude subscription.

Model Accuracy (95% CI) Status Tokens Est. Cost Duration Date