Writing Evals#
How to create eval tasks for your CoCo skills.
Quick path: use the CoCo skill#
This walks you through creating a task with guided prompts.
Manual path#
Create a directory with three files:
Step 1: Write instruction.md#
Use the IDD template for consistent, deterministic results:
## Goal
<What should exist after the agent runs>
## Requirements
- <Intent statement 1>
- <Intent statement 2>
## Constraints
- Do not modify files outside /workspace
- <Your constraints>
## Output
Success criteria:
- <Verifiable condition 1>
- <Verifiable condition 2>
Validate before committing
Run the IDD scorer on your instruction to catch quality issues early:
Step 2: Write tests/test.sh#
The test script runs inside the Docker sandbox after the agent finishes. Exit code 0 means pass, non-zero means fail.
#!/bin/bash
set -e
# Check the file exists
test -f /workspace/output.txt
# Check the content
grep -q "expected string" /workspace/output.txt
echo "PASS"
Common test patterns
Step 3: Write task.toml#
Minimal configuration:
See Task Configuration for all options.
Step 4: Run#
Design principles#
One eval, one behavior#
Each eval tests exactly one thing. Do not combine "create a file AND fix a bug AND run tests" into a single task. Split them:
evals/
├── create-config-file/ # tests: can agent create config?
├── fix-import-error/ # tests: can agent fix imports?
└── run-test-suite/ # tests: can agent run pytest?
This gives you:
- Clear failure signals (you know which behavior broke)
- Parallel execution (faster feedback)
- Easy addition and removal of individual tests
Starter files#
If the agent needs existing files to work with (code to fix, config to modify), place them in a starter/ directory:
fix-import-error/
├── task.toml
├── instruction.md
├── starter/
│ └── app.py # (1)!
└── tests/
└── test.sh
- This broken file gets copied to
/workspace/app.pybefore the agent runs.
Files in starter/ are copied to /workspace/ before the agent starts.
Custom Docker environment#
If your eval needs specific tools or services, add a compose.yaml:
services:
default: # (1)!
build:
context: ../../src/inspect_coco/sandbox
dockerfile: Dockerfile
init: true
command: ["tail", "-f", "/dev/null"]
environment:
- DATABASE_URL=postgres://localhost/test
- The service must be named
default. Inspect requires this convention.
Scaffolding from a plugin#
If you have a CoCo plugin with skills, the scaffold command auto-generates evals:
It reads your .cortex-plugin/plugin.json, finds leaf skills (skipping routers), and creates one eval per skill with IDD-structured instructions.