Did They Nerf Claude?

GPQA Diamond benchmark tracking across Claude models, updated twice daily. GitHub

Why this exists: Model providers sometimes ship regressions. Anthropic acknowledged it, Stanford measured it in GPT-4. The only way to know for sure is to keep running the same benchmark on the same models, automatically.

This dashboard tracks Claude models on GPQA Diamond (50 expert-level science questions per run) via the Claude Code CLI — using a regular Claude subscription, no API keys needed.

Run it yourself with your Claude subscription
# Requires: Claude Code CLI (npm i -g @anthropic-ai/claude-code)
uv pip install eval-claude

# Run 50 GPQA Diamond questions with Sonnet
uvx --from eval-claude inspect eval inspect_evals/gpqa_diamond \
  --model claude-code/claude-sonnet-4-6 \
  --limit 50 \
  -T epochs=1 \
  -M max_connections=10

Accuracy

Token Usage

Estimated Cost (API pricing)

Cost estimates based on Anthropic API pricing via LiteLLM. Actual runs use a Claude subscription.

Model Accuracy (95% CI) Status Tokens Est. Cost Duration Date