Scorers#
Scorers run after the agent completes and produce Score objects with metrics.
Verification#
Runs test commands in the sandbox and reports pass/fail.
inspect_coco.scorers.verification
#
Verification scorer — runs test command in sandbox and reports pass/fail.
verification(test_cmd='bash /workspace/tests/test.sh', timeout=300)
#
Score by running a test command in the sandbox.
Executes the test command inside the Docker sandbox after the agent completes. Exit code 0 = pass (1.0), non-zero = fail (0.0).
Parameters:
| Name | Type | Description | Default |
|---|---|---|---|
test_cmd
|
str
|
Shell command to run in the sandbox. |
'bash /workspace/tests/test.sh'
|
timeout
|
int
|
Maximum seconds for test execution. |
300
|
Source code in src/inspect_coco/scorers/verification.py
IDD Quality#
Reports instruction quality as a score in the eval summary.
inspect_coco.scorers.idd_quality
#
IDD quality scorer — surfaces instruction quality score in eval results.
idd_quality(instruction, threshold=0.6)
#
Score the IDD quality of the instruction.
This scorer runs once per sample and reports the IDD quality score. It does not depend on sandbox execution.
Parameters:
| Name | Type | Description | Default |
|---|---|---|---|
instruction
|
str
|
The raw instruction text to score. |
required |
threshold
|
float
|
Pass threshold for IDD score. |
0.6
|
Source code in src/inspect_coco/scorers/idd_quality.py
idd_score()
#
Average IDD quality score across samples.