Final GPQA Diamond history across Claude models, with run-capture status for missing benchmark data. GitHub
Why this exists: Model providers sometimes ship regressions. Anthropic acknowledged it, Stanford measured it in GPT-4. The only way to know for sure is to keep rerunning the same benchmark on the same models and keep the capture history visible.
This dashboard tracks Claude models on GPQA Diamond (50 expert-level science questions per run) via the Claude Code CLI — using a regular Claude subscription, no API keys needed.
# Requires: Claude Code CLI (npm i -g @anthropic-ai/claude-code)
uv pip install eval-claude
# Run 50 GPQA Diamond questions with Sonnet
uvx --from eval-claude inspect eval inspect_evals/gpqa_diamond \
--model claude-code/claude-sonnet-4-6 \
--limit 50 \
-T epochs=1 \
-M max_connections=10
Cost estimates based on Anthropic API pricing via LiteLLM. Actual runs use a Claude subscription.
| Model | Accuracy (95% CI) | Status | Tokens | Est. Cost | Duration | Date |
|---|
| Run | Started | Workflow | Capture | Rows | Samples | Artifacts | Notes |
|---|