Inspect CoCo#

AI agents are non-deterministic. Test whether yours works reliably.

inspect-coco runs your agent against structured instructions inside isolated Docker containers, verifies output with deterministic tests, and repeats the process to surface flaky behavior.

The core question it answers: does this skill do the right thing every time?

How it works#

flowchart TD
    A[instruction.md] --> B[IDD quality check]
    B --> C[Docker sandbox]
    C --> D[cortex exec]
    D --> E[test.sh]
    E --> F{pass/fail}
    F -->|repeat k times| C
    F --> G["pass@k score"]

Your instruction describes what the agent should accomplish, structured with Goal, Requirements, Constraints, and Output sections.
The IDD scorer checks instruction quality before running anything expensive.
A Docker sandbox provides a clean, isolated environment for each run.
The CoCo agent executes the instruction via cortex exec.
A verification script checks whether the agent produced the correct result.
Epochs repeat the process multiple times to measure consistency.

Why Inspect AI?#

Agent evaluation requires sandboxed code execution, not text scoring. See Why Inspect AI? for the full rationale and comparison with Promptfoo, DeepEval, Braintrust, LangSmith, and Eleuther.

Why structured instructions matter#

Vague instructions produce inconsistent results. When you tell an agent to "set up the project properly," each run takes a different path. Structured instructions (IDD format) narrow the solution space so the agent converges on the same correct behavior across runs.

This is the hypothesis inspect-coco validates: high instruction quality predicts high pass@k consistency.

Quick start#

git clone https://github.com/kameshsampath/inspect-coco.git && cd inspect-coco
task quickstart

This installs dependencies, runs the hello-world eval (3 epochs), and opens the results viewer.

First time setup

See Getting Started for prerequisites (Docker, Task, Cortex Code CLI, Snowflake connection) and a full walkthrough.

What you get#

Pass@k consistency scores across repeated runs
IDD quality feedback on your instructions before running expensive evals
Full transcripts of every agent conversation (tool calls, responses, timing)
Scaffolding that generates eval tasks from existing plugin structure
Zero SaaS dependencies -- everything runs locally with Docker