task.toml Reference#

Every eval task has a task.toml file that controls how it runs.

Minimal Example#

version = "1.0"

[metadata]
name = "my-task"

[agent]
timeout_sec = 900

That's it. Everything else has sensible defaults.

Full Example (annotated)#

version = "1.0"

[metadata]
name = "my-task"
description = "What this eval tests"

# Run the task 3 times to measure consistency (pass@k)
epochs = 3

# Instruction quality threshold. Below this = warning before running.
idd_threshold = 0.6

# Set to true to block execution when instruction quality is low
idd_strict = false

[agent]
# How long cortex exec can run (seconds)
timeout_sec = 900

# Maximum tool-use turns before stopping
max_turns = 30

# Override model (default: CoCo auto mode picks the best model)
# model = "claude-sonnet-4-5"

# Snowflake connection name from your connections.toml
# connection = "default"

# Working directory inside the Docker container
workdir = "/workspace"

# Disable specific bundled skills during this eval
# remove_skills = ["developing-with-streamlit-in-snowflake"]

[environment]
# Custom test command (default: bash /workspace/tests/test.sh)
# test_cmd = "pytest /workspace/tests -v"

# How long the test script can run (seconds)
test_timeout = 300

# Custom Docker Compose file (relative to task directory)
# compose = "compose.yaml"

Sections#

`[metadata]`#

Field	Type	Default	Description
`name`	string	directory name	Identifier shown in results
`description`	string	-	Human-readable explanation
`epochs`	int	3	Number of runs for consistency
`idd_threshold`	float	0.6	Minimum instruction quality score
`idd_strict`	bool	false	Fail (not just warn) on low score

`[agent]`#

Field	Type	Default	Description
`timeout_sec`	int	900	Agent execution timeout
`max_turns`	int	-	Cap on tool-use turns
`model`	string	-	Model override (omit for auto)
`connection`	string	-	Named Snowflake connection
`workdir`	string	/workspace	Container working directory
`remove_skills`	list	-	Skills to disable

`[environment]`#

Field	Type	Default	Description
`test_cmd`	string	`bash /workspace/tests/test.sh`	Verification command
`test_timeout`	int	300	Test execution timeout
`compose`	string	-	Custom compose file path

Epochs and Consistency#

Epochs control how many times the same task runs. This measures pass@k:

epochs = 1 - single run, no consistency data
epochs = 3 - default, basic consistency signal
epochs = 5 - stronger signal, slower

Example: if a task passes 2 out of 3 epochs, the pass rate is 66%. A well-written instruction (high IDD score) should pass all epochs consistently.

Important

Higher epochs mean longer total run time. Each epoch runs the full agent + test cycle. A 900s timeout task with 5 epochs could take up to 75 minutes.

File Structure#

my-task/
├── task.toml          # This file
├── instruction.md     # Agent prompt (IDD-structured)
├── tests/
│   └── test.sh        # Verification (exit 0 = pass)
├── starter/           # Optional: files copied to /workspace
└── compose.yaml       # Optional: custom Docker environment

Custom Docker Compose#

If your eval needs extra services (database, API mock) or custom environment variables, add a compose.yaml in the task directory. Inspect auto-discovers it.

services:
  default:
    build:
      context: ../../src/inspect_coco/sandbox
      dockerfile: Dockerfile
    init: true
    command: ["tail", "-f", "/dev/null"]
    environment:
      - MY_CUSTOM_VAR=some-value

Note

The service must be named default. Inspect uses this name to find the primary sandbox.