Getting Started#
This guide walks you through running your first eval with inspect-coco.
Prerequisites#
You need the following tools installed and working before proceeding:
| Tool | Purpose | Install |
|---|---|---|
| Python 3.12+ | Runtime | python.org |
| Docker 20.10+ | Sandbox execution | docker.com |
| Task | Task runner | brew install go-task or other methods |
Snowflake CLI (snow) |
Connection setup | pip install snowflake-cli |
| Cortex Code CLI | Agent runtime (beta) | docs.snowflake.com |
Authentication
Password authentication is not supported. You must use one of:
- Local OAuth (
OAUTH_AUTHORIZATION_CODE) -- recommended for local development. Browser login, tokens stored in OS keychain, no secrets in Docker. - Key-pair authentication (JWT) with a PEM private key file
- Programmatic Access Token (PAT)
See the Security Model for a comparison.
Install inspect-coco#
The recommended approach is to clone the repo and use the Taskfile:
Alternatively, install as a dependency in another project:
Configure Snowflake connection#
inspect-coco reads your existing ~/.snowflake/connections.toml file. If you already use the snow CLI or Cortex Code, you are set.
To create a new connection:
Or edit ~/.snowflake/connections.toml directly:
Non-default connections
If your connection is not named default, set the environment variable:
Or create a .env file in your project root:
Custom config location
Set SNOWFLAKE_HOME if your configuration lives somewhere other than ~/.snowflake.
Run your first eval#
If you cloned the repo, you can run everything in one command:
Or step by step:
This does the following:
- Checks instruction quality (IDD score).
- Starts a Docker container with Cortex Code.
- Runs the instruction through
cortex exec. - Executes the test script to verify the result.
- Repeats 3 times (epochs) for consistency measurement.
View results#
This opens a browser-based log viewer showing:
- Pass/fail per epoch (pass@k consistency)
- Full conversation transcript (messages, tool calls)
- Token usage and timing
- Scorer output (verification results and IDD quality)
Next steps#
- Writing Evals for creating your own eval tasks
- IDD Scoring for understanding instruction quality
- Metrics and Reporting for interpreting results
- CLI Reference for all available commands