Architecture#
This page describes how inspect-coco works internally and how the pieces fit together.
Execution flow#
When you run inspect-coco run, the following sequence occurs:
sequenceDiagram
participant CLI as inspect-coco run
participant Loader as Task Loader
participant IDD as IDD Scorer
participant Docker as Docker Sandbox
participant Agent as cortex exec
participant Test as Verification
CLI->>Loader: Read task.toml + instruction.md
Loader->>IDD: Score instruction quality
IDD-->>Loader: IDD score + metadata
Loader->>Docker: Start container, mount files
loop For each epoch (pass@k)
Docker->>Agent: Run cortex exec with instruction
Agent-->>Docker: Agent output (files, changes)
Docker->>Test: Execute test.sh
Test-->>Docker: Exit code (0=pass, 1=fail)
end
Docker-->>CLI: Eval log with scores
Credential flow#
The agent needs Snowflake credentials to call the Cortex API. The credential flow varies by authentication method:
flowchart TD
A["~/.snowflake/connections.toml"] --> B[resolve_connection]
B --> C{authenticator?}
C -->|OAUTH_AUTHORIZATION_CODE| D[Load from OS keychain]
C -->|SNOWFLAKE_JWT| E[Read PEM key file]
C -->|PROGRAMMATIC_ACCESS_TOKEN| F[Read token]
D --> G{Tokens cached?}
G -->|Yes| H[Start token proxy thread]
G -->|No| I[Browser OAuth flow]
I --> J[Store in keychain]
J --> H
E --> K[deploy_credentials]
F --> K
H --> L["Sandbox calls GET /token via host-gateway"]
K --> M["/root/.snowflake/connections.toml (inside container)"]
L --> M
M --> N[cortex exec uses connection]
| Auth method | Secret storage | What enters Docker | Refresh capability |
|---|---|---|---|
| OAuth (recommended) | OS keychain (encrypted) | Short-lived access token only | Automatic (silent + browser re-auth) |
| JWT | Host filesystem | PEM private key | N/A (key never expires) |
| PAT | Host filesystem / env var | Token value | N/A (long-lived) |
Recommended for local development
Use authenticator = "OAUTH_AUTHORIZATION_CODE" for the strongest security posture. The sandbox never holds any long-lived credential. See the Security Model page for details.
Connection resolution
The resolver checks INSPECT_COCO_SNOWFLAKE_CONNECTION first, then falls back to a connection named default in your TOML file. Run snow connection list to see available connections.
Sandbox environment#
Each eval runs inside an isolated Docker container built from src/inspect_coco/sandbox/Dockerfile:
flowchart LR
subgraph Container
direction TB
CoCo[Cortex Code CLI]
Py[Python 3.12]
Pytest[pytest]
UV[uv]
end
subgraph HostAccess ["Host Access (via extra_hosts)"]
direction TB
TokenProxy["Token Proxy (OAuth)"]
end
subgraph Mounted
direction TB
Creds[Snowflake credentials]
Starter[starter/ files]
Tests[tests/ scripts]
end
Mounted --> Container
HostAccess --> Container
Container --> Result[Exit code + stdout]
The container includes:
- Python 3.12 as the base runtime
- Cortex Code CLI (beta channel by default, controlled via
CORTEX_CHANNELbuild arg) - pytest for running test suites
- uv for package management inside the sandbox
- host-gateway access to the token proxy (OAuth connections only)
Scoring pipeline#
Two scorers run after the agent completes:
flowchart TD
A[Agent completes] --> B[verification scorer]
A --> C[idd_quality scorer]
B --> D["Runs test.sh in sandbox"]
D --> E["Score: 1.0 (pass) or 0.0 (fail)"]
C --> F["Scores instruction text"]
F --> G["Score: 0.0 to 1.0"]
E --> H[Eval log]
G --> H
The verification scorer produces passed and total metrics (how many epochs passed out of how many ran). The idd_quality scorer produces an idd_score metric (instruction quality rating).
Configuration merge order#
When multiple configuration sources exist, they merge with this priority:
flowchart BT
A[Built-in defaults] --> B[suite.yaml defaults]
B --> C[suite.yaml per-task overrides]
C --> D[task.toml values]
D --> E[CLI flags]
E --> F[Final resolved config]
The highest priority source wins for each setting. CLI flags always override everything else.