Skip to content

Metrics and Reporting#

inspect-coco reports two scorer results in every eval run. Each scorer produces its own metrics that appear in the eval summary and log viewer.

Scorers#

verification#

Runs the test command (tests/test.sh or a custom test_cmd) inside the Docker sandbox after the agent completes. Exit code 0 means pass, non-zero means fail.

Metric Type Meaning
passed count Number of epochs where the test passed
total count Total number of epochs scored

The pass rate across epochs is your pass@k consistency signal.

Reading pass@k

With epochs: 3:

  • passed=3 total=3 means the agent succeeded every time. Reliable.
  • passed=1 total=3 means flaky behavior. Likely a vague instruction.
  • passed=0 total=3 means broken. Check your test script first.

idd_quality#

Scores the instruction quality using the IDD rubric. This runs once per sample and does not depend on sandbox execution. It reports how well-structured the prompt is before the agent even starts.

Metric Type Meaning
idd_score float (0.0 to 1.0) Average IDD quality score

The score metadata also includes a per-dimension breakdown:

Dimension What it checks
idd_goal Presence and clarity of a Goal section
idd_requirements Presence of intent-based requirements
idd_constraints Presence of scope and safety constraints
idd_output Presence of verifiable success criteria

Interpreting results#

verification idd_quality Diagnosis
3/3 >= 0.8 Healthy eval. Instruction is clear and agent is consistent.
1/3 or 2/3 >= 0.8 Agent issue. Instruction is good but agent behavior varies.
1/3 or 2/3 < 0.6 Instruction issue. Improve the prompt using IDD template.
0/3 any Broken test or unsolvable task. Check test.sh logic first.

The hypothesis

High IDD score predicts high pass@k. Over time, you will see that tasks scoring above 0.8 on IDD pass consistently (3/3), while tasks below 0.5 are flaky (1/3 or 2/3).

Viewing results#

After running inspect-coco run, use Inspect's built-in tools:

# Open the web viewer (serves all logs)
inspect view

# List recent eval logs
inspect log list

# Dump a specific log as JSON
inspect log dump logs/<log-file>.eval

The log viewer shows both scorers side by side with full conversation transcripts, token usage, and timing data.

Configuration#

Control epochs in task.toml:

task.toml
[metadata]
epochs = 3          # number of repetitions for pass@k
idd_threshold = 0.6 # minimum IDD score before warning

Or override via CLI:

inspect-coco run examples/ --epochs 5