Metrics and Reporting#
inspect-coco reports two scorer results in every eval run. Each scorer produces its own metrics that appear in the eval summary and log viewer.
Scorers#
verification#
Runs the test command (tests/test.sh or a custom test_cmd) inside the Docker sandbox after the agent completes. Exit code 0 means pass, non-zero means fail.
| Metric | Type | Meaning |
|---|---|---|
passed |
count | Number of epochs where the test passed |
total |
count | Total number of epochs scored |
The pass rate across epochs is your pass@k consistency signal.
Reading pass@k
With epochs: 3:
passed=3 total=3means the agent succeeded every time. Reliable.passed=1 total=3means flaky behavior. Likely a vague instruction.passed=0 total=3means broken. Check your test script first.
idd_quality#
Scores the instruction quality using the IDD rubric. This runs once per sample and does not depend on sandbox execution. It reports how well-structured the prompt is before the agent even starts.
| Metric | Type | Meaning |
|---|---|---|
idd_score |
float (0.0 to 1.0) | Average IDD quality score |
The score metadata also includes a per-dimension breakdown:
| Dimension | What it checks |
|---|---|
idd_goal |
Presence and clarity of a Goal section |
idd_requirements |
Presence of intent-based requirements |
idd_constraints |
Presence of scope and safety constraints |
idd_output |
Presence of verifiable success criteria |
Interpreting results#
| verification | idd_quality | Diagnosis |
|---|---|---|
| 3/3 | >= 0.8 | Healthy eval. Instruction is clear and agent is consistent. |
| 1/3 or 2/3 | >= 0.8 | Agent issue. Instruction is good but agent behavior varies. |
| 1/3 or 2/3 | < 0.6 | Instruction issue. Improve the prompt using IDD template. |
| 0/3 | any | Broken test or unsolvable task. Check test.sh logic first. |
The hypothesis
High IDD score predicts high pass@k. Over time, you will see that tasks scoring above 0.8 on IDD pass consistently (3/3), while tasks below 0.5 are flaky (1/3 or 2/3).
Viewing results#
After running inspect-coco run, use Inspect's built-in tools:
# Open the web viewer (serves all logs)
inspect view
# List recent eval logs
inspect log list
# Dump a specific log as JSON
inspect log dump logs/<log-file>.eval
The log viewer shows both scorers side by side with full conversation transcripts, token usage, and timing data.
Configuration#
Control epochs in task.toml:
[metadata]
epochs = 3 # number of repetitions for pass@k
idd_threshold = 0.6 # minimum IDD score before warning
Or override via CLI: