Did They Nerf Claude?

Final GPQA Diamond history across Claude models, with run-capture status for missing benchmark data. GitHub

Why this exists: Model providers sometimes ship regressions. Anthropic acknowledged it, Stanford measured it in GPT-4. The only way to know for sure is to keep rerunning the same benchmark on the same models and keep the capture history visible.

This dashboard tracks Claude models on GPQA Diamond (50 expert-level science questions per run) via the Claude Code CLI — using a regular Claude subscription, no API keys needed.

Run it yourself with your Claude subscription
# Requires: Claude Code CLI (npm i -g @anthropic-ai/claude-code)
uv pip install eval-claude

# Run 50 GPQA Diamond questions with Sonnet
uvx --from eval-claude inspect eval inspect_evals/gpqa_diamond \
  --model claude-code/claude-sonnet-4-6 \
  --limit 50 \
  -T epochs=1 \
  -M max_connections=10

Accuracy

Token Usage

Estimated Cost (API pricing)

Cost estimates based on Anthropic API pricing via LiteLLM. Actual runs use a Claude subscription.

Model Accuracy (95% CI) Status Tokens Est. Cost Duration Date

Run capture history

Run Started Workflow Capture Rows Samples Artifacts Notes