GPQA Diamond benchmark tracking across Claude models, updated twice daily. GitHub
Why this exists: Model providers sometimes ship regressions. Anthropic acknowledged it, Stanford measured it in GPT-4. The only way to know for sure is to keep running the same benchmark on the same models, automatically.
This dashboard tracks Claude models on GPQA Diamond (50 expert-level science questions per run) via the Claude Code CLI — using a regular Claude subscription, no API keys needed.
# Requires: Claude Code CLI (npm i -g @anthropic-ai/claude-code)
uv pip install eval-claude
# Run 50 GPQA Diamond questions with Sonnet
uvx --from eval-claude inspect eval inspect_evals/gpqa_diamond \
--model claude-code/claude-sonnet-4-6 \
--limit 50 \
-T epochs=1 \
-M max_connections=10
Cost estimates based on Anthropic API pricing via LiteLLM. Actual runs use a Claude subscription.
| Model | Accuracy (95% CI) | Status | Tokens | Est. Cost | Duration | Date |
|---|