Inspect CoCo#
AI agents are non-deterministic. Test whether yours works reliably.
inspect-coco runs your agent against structured instructions inside isolated Docker containers, verifies output with deterministic tests, and repeats the process to surface flaky behavior.
The core question it answers: does this skill do the right thing every time?
How it works#
flowchart TD
A[instruction.md] --> B[IDD quality check]
B --> C[Docker sandbox]
C --> D[cortex exec]
D --> E[test.sh]
E --> F{pass/fail}
F -->|repeat k times| C
F --> G["pass@k score"]
- Your instruction describes what the agent should accomplish, structured with Goal, Requirements, Constraints, and Output sections.
- The IDD scorer checks instruction quality before running anything expensive.
- A Docker sandbox provides a clean, isolated environment for each run.
- The CoCo agent executes the instruction via
cortex exec. - A verification script checks whether the agent produced the correct result.
- Epochs repeat the process multiple times to measure consistency.
Why Inspect AI?#
Agent evaluation requires sandboxed code execution, not text scoring. See Why Inspect AI? for the full rationale and comparison with Promptfoo, DeepEval, Braintrust, LangSmith, and Eleuther.
Why structured instructions matter#
Vague instructions produce inconsistent results. When you tell an agent to "set up the project properly," each run takes a different path. Structured instructions (IDD format) narrow the solution space so the agent converges on the same correct behavior across runs.
This is the hypothesis inspect-coco validates: high instruction quality predicts high pass@k consistency.
Quick start#
This installs dependencies, runs the hello-world eval (3 epochs), and
opens the results viewer.
First time setup
See Getting Started for prerequisites (Docker, Task, Cortex Code CLI, Snowflake connection) and a full walkthrough.
What you get#
- Pass@k consistency scores across repeated runs
- IDD quality feedback on your instructions before running expensive evals
- Full transcripts of every agent conversation (tool calls, responses, timing)
- Scaffolding that generates eval tasks from existing plugin structure
- Zero SaaS dependencies -- everything runs locally with Docker