# Evaluate your codebase agent readiness

{% hint style="warning" %}
This feature is in public beta. Workflows and output formats may change in upcoming releases.
{% endhint %}

This page outlines a toolkit for measuring your codebase's **agent readiness** — how well your context files (skills, rules, documentation) enable an AI agent to complete real tasks on your codebase. It covers scenario definition, running agents with different setups, testing variations, and comparing results.

## TL;DR

Tessl Evals is a toolkit for measuring your codebase's **agent readiness** — specifically, the impact that your agent configuration, model choice, and context files (skills, CLAUDE.md, etc.) are having on your agent's ability to work on your codebase.

For example, you might want to know whether the skills and MD files your agent uses are helping or hurting its ability to complete real tasks. Or you might want to test how a different agent or model would perform on your codebase, and what it would cost.

The key idea is that instead of synthetic tests, you base evaluations on real work that's already been done. Here's how it fits together:

* **You identify a realistic task**. By looking at recent commits, you find a meaningful change that was made to your codebase, for example:
  * a commit that added a fraud detection report feature. This becomes the basis for your scenario.
* **You generate a scenario from it.** Tessl analyses the commit diff and produces a task description (what an agent would be asked to do) and a scoring rubric (how to judge whether the agent did it correctly) — essentially reconstructing the intent of the original change as an agent task.
* **You define the experiment.** This is where you choose what to test. You might run the scenario with your context files stripped out to get a baseline — then run it again with them injected, so you can see the delta. Or you might run the same task against multiple agents or models to compare performance and cost.
* **You read the results.** The scores tell you, concretely, the extent to which your agentic setup is helping or hindering your agent's ability to complete the kinds of tasks that are actually being done on your codebase.

## How it works

Each scenario is solved **twice** by default:

1. **Baseline** — the agent works on the repo with your context files stripped out
2. **With context** — the agent works on the repo with your context files injected

Comparing the two scores gives you a concrete **agent readiness signal** — whether your skills, rules, and documentation are making the agent more effective on real tasks.

## Prerequisites

* Tessl [installed](https://docs.tessl.io/introduction-to-tessl/installation) (latest version)
* Logged into Tessl
* Access to a [workspace](https://docs.tessl.io/reference/workspaces) (you must be at least a [Member](https://docs.tessl.io/reference/roles))
* Your GitHub or GitLab account connected in workspace settings

***

## Shortcut: use the eval skills

If you'd prefer to have an AI agent walk you through the process interactively, two Tessl skills automate the full pipeline:

* [**tessl-labs/eval-setup**](https://tessl.io/registry/tessl-labs/eval-setup) — guides you through commit selection, scenario generation, multi-agent configuration, and your first eval run
* [**tessl-labs/eval-improve**](https://tessl.io/registry/tessl-labs/eval-improve) — analyzes your results, classifies failing criteria, applies targeted fixes to your context files, and re-runs to verify improvement
* [**tessl-labs/scenarios-review**](https://tessl.io/registry/tessl-labs/scenarios-review) — audits your scenario suite for rubric quality, task clarity, coverage gaps, and difficulty spread, and prints a concise quality report

```sh
tessl install tessl-labs/eval-setup
tessl install tessl-labs/eval-improve
tessl install tessl-labs/scenarios-review
```

Once installed, invoke them with `/eval-setup`, `/eval-improve`, or `/scenarios-review` in your AI coding agent.

***

## Step 1: (Optional) Browse commits and pick what to evaluate

A **scenario** defines a task for an agent, the starting state of the codebase, and a scoring rubric to evaluate the agent's output.

Scenarios can be [written by hand](#writing-your-own-scenarios) — skip to [Step 5](#step-5-run-the-eval) if you already have them. If you already have specific commit hashes or PR numbers in mind, skip straight to [Step 2](#step-2-optional-generate-scenarios-from-commits-or-prs). Otherwise, use `tessl repo select-commits` to browse recent commits and choose which ones to turn into scenarios. This command is not listed in `tessl --help` but works when invoked directly.

```sh
$ tessl repo select-commits <org/repo>
```

**Useful flags:**

| Flag                  | Example               | Description                       |
| --------------------- | --------------------- | --------------------------------- |
| `--keyword`           | `--keyword=feat`      | Filter by commit message keyword  |
| `--author`            | `--author="Alice"`    | Filter by author name             |
| `--since` / `--until` | `--since=2026-01-01`  | Date range (YYYY-MM-DD)           |
| `--count` / `-n`      | `--count=20`          | Number of commits to show (1–100) |
| `--workspace` / `-w`  | `--workspace=engteam` | Required outside interactive mode |

**Output:** a table of `Hash | Date | Author | Message`. Copy the hashes you want to pass to the next step.

**Prerequisite:** your GitHub or GitLab account must be connected in workspace settings. If it isn't, the error message includes a direct link to the settings page.

***

## Step 2: (Optional) Generate scenarios from commits or PRs

Tessl analyzes the commit diffs and generates a set of task scenarios. You can provide either commit hashes or pull request numbers:

```sh
# From commits
$ tessl scenario generate <org/repo> --commits=<hash1>,<hash2>

# From PRs
$ tessl scenario generate <org/repo> --prs <number1>,<number2>
```

**Flags:**

| Flag                 | Example                   | Description                                   |
| -------------------- | ------------------------- | --------------------------------------------- |
| `--commits`          | `--commits=abc123,def456` | Comma-separated commit hashes                 |
| `--prs`              | `--prs 42,107`            | Comma-separated PR numbers                    |
| `--context`          | `--context="*.mdc,*.md"`  | Glob patterns identifying your context files  |
| `--workspace` / `-w` | `--workspace=engteam`     | Required outside interactive mode             |
| `--json`             |                           | Output generation IDs as JSON without polling |

### The --context flag

`--context` tells Tessl which files in your repo are context files — skills, rules, documentation, etc. These patterns are stored in each generated `scenario.json` as `fixture.exclude` and serve two purposes:

* They are **stripped** from the repo for the baseline run so the agent works without context
* They are **injected back** for the with-context run so you can measure the delta

When omitted, Tessl defaults to: `*.mdc`, `*.md`, `tile.json`, `tessl.json`, `.tessl/`

Generation runs server-side. The CLI polls until complete. If you press Ctrl-C, the job keeps running - check on it later with `tessl scenario list`.

### Output

When generation completes, the CLI prints a table with the **Scenario ID** (the ID you pass to `tessl scenario download`) and either the source commit or PR:

```
# From --commits
✔ Generated 3 scenarios
Scenario ID                           Commit ID
019cb3cf-5f2b-776d-aead-5714d74ef567  f9256db
019cb3d1-10fc-755e-af3f-c079a32d8c49  a13eb72
019cb3d3-abcd-1234-beef-cafe00000000  3cd6e7b

# From --prs
✔ Generated 1 scenario
Scenario ID                           PR
019cb9f3-e169-73fd-b0e4-5943e244b915  PR #1234
```

Each commit or PR produces its own generation with its own Scenario ID. Keep these IDs handy — you'll need them in Step 4 if you passed multiple commits or PRs.

***

## Step 3: (Optional) Review the generation

*Applies only if you used `tessl scenario generate` in Step 2.*

Before downloading, you can inspect what was generated.

```sh
# List recent generations
$ tessl scenario list

# Inspect a specific generation
$ tessl scenario view <id>

# Inspect the most recent generation
$ tessl scenario view --last
```

`tessl scenario list` shows a table of ID, Workspace, Status, Created By, and Created. `tessl scenario view` shows metadata and a table of generated scenarios with titles and checklist item counts.

***

## Step 4: (Optional) Download scenarios to disk

*Applies only if you used `tessl scenario generate` in Steps 1–2. If you wrote your own scenarios by hand, skip to* [*Step 5*](#step-5-run-the-eval)*.*

```sh
$ tessl scenario download --last
```

Or with a specific ID:

```sh
$ tessl scenario download <id>
```

{% hint style="info" %}
`--last` downloads scenarios from the single most recent generation. If you passed multiple commits to `tessl scenario generate`, each commit produces its own generation with its own ID — `--last` will only get the most recent one. Use `tessl scenario list` to find the other IDs and download each separately.
{% endhint %}

**Flags:**

| Flag                | Description                                                                               |
| ------------------- | ----------------------------------------------------------------------------------------- |
| `--last`            | Download from the most recent generation                                                  |
| `--output` / `-o`   | Output directory (default: `evals`)                                                       |
| `--strategy` / `-s` | `merge` (default) adds alongside existing scenarios; `replace` clears the directory first |

**What lands on disk:**

```
evals/
  <7-char-hash>-<slug>/
    task.md          ← task brief shown to the agent
    criteria.json    ← weighted checklist rubric
    scenario.json    ← fixture with repo URL, commit ref, and context exclude patterns
```

You can edit `task.md` and `criteria.json` before running — your edits are picked up at run time. See [File formats](#file-formats) below.

***

## Step 5: Run the eval

Run from the parent directory of your scenarios folder:

```sh
$ tessl eval run ./evals/ --workspace=<workspace>
```

The CLI auto-detects that this is a codebase eval from the `scenario.json` fixtures and applies smart defaults:

| Setting         | Default                                | Override with                      |
| --------------- | -------------------------------------- | ---------------------------------- |
| Agent           | `claude:claude-sonnet-4-6`             | `--agent=<agent:model>`            |
| Context pattern | `fixture.exclude` from `scenario.json` | `--context-pattern="<globs>"`      |
| Context ref     | `infer` (same commit as fixture)       | `--context-ref=<infer\|HEAD\|SHA>` |
| Workspace       | *(none — required)*                    | `--workspace=<name>`               |

{% hint style="warning" %}
`--workspace` is required. If any scenarios in the target directory are missing a `fixture.exclude` field (e.g. scenarios downloaded before the `--context` flag was introduced), the command will fail with `"No context patterns available — scenario.json is missing fixture.exclude"`. Regenerate those scenarios with `--context` or move new scenarios into a subdirectory and point the command at that instead.
{% endhint %}

Because the context pattern defaults from the fixture, **baseline vs with-context runs happen automatically** — no extra flags needed.

### Running with a specific agent

By default, evals run using `claude:claude-sonnet-4-6`. You can override this with `--agent`:

```sh
$ tessl eval run ./evals/ --workspace=<workspace> --agent=claude:claude-opus-4-6
```

Each `--agent` creates a separate eval run. To compare models, pass multiple `--agent` flags:

```sh
$ tessl eval run ./evals/ --workspace=<workspace> \
  --agent=claude:claude-sonnet-4-6 \
  --agent=claude:claude-opus-4-6 \
  --agent=claude:claude-haiku-4-5
```

Supported Claude models:

| Model               | Notes                |
| ------------------- | -------------------- |
| `claude-sonnet-4-6` | Default              |
| `claude-opus-4-6`   | Most capable         |
| `claude-sonnet-4-5` |                      |
| `claude-opus-4-5`   |                      |
| `claude-haiku-4-5`  | Fastest, lowest cost |

### Testing updated context against historical scenarios

To test your latest context files against scenarios that were generated from older commits:

```sh
$ tessl eval run ./evals/ --context-ref=HEAD
```

`--context-ref=HEAD` sources context files from the latest commit on the default branch instead of the commit in `fixture.ref`. This lets you measure how context improvements affect performance on historical tasks.

### Output

```
✔ Ran 5 scenarios with 1 agent
  ✔ claude:claude-sonnet-4-6  019c81f5-ac2a-746f-bed1-78a875a0480d

  View in browser: https://tessl.io/eval-runs/019c81f5-ac2a-746f-bed1-78a875a0480d
ℹ Run tessl eval view <id> to see details for a run.
ℹ Run tessl eval compare ./evals/ --workspace=<workspace> to compare results for this batch.
```

Ctrl-C detaches without cancelling — runs continue server-side. The CLI prints each run ID so you can check progress later with `tessl eval view <id>` or `tessl eval list`.

***

## Step 6: Compare results

```sh
$ tessl eval compare ./evals/ --workspace=<workspace>
```

Pass the same scenarios directory used with `eval run`. The CLI fingerprints your local scenarios and fetches matching results from the server. The delta between baseline and with-context scores is your **agent readiness improvement** — how much your context files are moving the needle.

**With context comparison** (the default when `fixture.exclude` is present):

```
org/repo (./evals/)
  2 scenarios with results, 1 agent, 4 runs

  Averages
    claude:claude-sonnet-4-6   baseline: 50%
                               *.mdc,*.md,tile.json,tessl.json,.tessl/ @ infer: 80%   Δ +30pp
                               (4 runs)
```

**Score colours:** 🟢 ≥ 80% 🟡 ≥ 50% 🔴 < 50%

{% hint style="info" %}
**Results can vary between runs.** Because you're evaluating an AI agent, scores are not fully deterministic — the same scenario run twice may produce slightly different results. Treat scores as signals rather than exact measurements, and expect some run-to-run variance when comparing results.
{% endhint %}

Use `--breakdown` to see per-scenario detail:

```sh
$ tessl eval compare ./evals/ --breakdown --workspace=<workspace>
```

***

## Checking status and viewing results

| Command                          | Description                               |
| -------------------------------- | ----------------------------------------- |
| `tessl eval list`                | List all eval runs                        |
| `tessl eval list --mine`         | Only your runs                            |
| `tessl eval list --type project` | Only codebase eval runs                   |
| `tessl eval view <id>`           | Detailed results for a specific run       |
| `tessl eval view --last`         | Detailed results for your most recent run |
| `tessl eval retry <id>`          | Re-run a failed eval                      |

If you lose a run ID, `tessl eval list` will find it.

***

## File formats

### scenario.json

Generated by `tessl scenario download`. Defines the fixture for the eval run.

```json
{
  "type": "coding",
  "fixture": {
    "type": "commit",
    "repoUrl": "https://github.com/org/repo.git",
    "ref": "24829180ba1fafb86b...",
    "exclude": ["*.mdc", "*.md", "tile.json", "tessl.json", ".tessl/"]
  }
}
```

* `fixture.ref` — the parent commit hash (the starting state for the agent)
* `fixture.exclude` — context patterns stripped for baseline; also used as the default `--context-pattern` at run time
* `fixture.repoUrl` — full clone URL

### task.md

Free-form markdown. This is the **only file the agent sees** — it has no access to `criteria.json`. Typically structured with Problem, Expected Behavior, and Acceptance Criteria sections. You can edit this freely before running.

### criteria.json

Defines how the agent's solution is scored.

```json
{
  "context": "Brief description of what is being evaluated.",
  "type": "weighted_checklist",
  "checklist": [
    {
      "name": "fixes_syntax_error",
      "description": "The solution adds the missing closing parenthesis...",
      "max_score": 1,
      "category": "INTENT"
    }
  ]
}
```

Required fields: `context`, `type` (`"weighted_checklist"`), `checklist` (array with `name`, `description`, `max_score`).

**Checklist categories:**

| Category      | Description                                                                                            |
| ------------- | ------------------------------------------------------------------------------------------------------ |
| `INTENT`      | Core feature or behavior the change introduces; verifies the solution addresses what the task requests |
| `DESIGN`      | Architectural or structural choices                                                                    |
| `MUST_NOT`    | Things the solution should avoid or never do                                                           |
| `MINIMALITY`  | Appropriate scope of changes — solution does what's needed without overreaching                        |
| `REUSE`       | Leveraging existing utilities or patterns rather than reimplementing                                   |
| `INTEGRATION` | How the solution connects with existing code                                                           |
| `EDGE_CASE`   | Boundary conditions handled correctly                                                                  |

**Scoring:** `(sum of scores / sum of max_scores) × 100`. The LLM grader can award partial credit.

***

## Writing your own scenarios

You can hand-author scenarios without using `tessl scenario generate`:

```
evals/
  my-custom-scenario/
    task.md
    criteria.json
    scenario.json
```

`fixture.ref` should be the parent of the ground truth commit. `fixture.exclude` defines what gets stripped for baseline and serves as the default `--context-pattern`.

***

## Quick reference

```sh
# Option A: generate scenarios from commits
$ tessl repo select-commits org/repo --keyword=feat --workspace=<workspace>
$ tessl scenario generate org/repo --commits=abc123,def456 --workspace=<workspace>
$ tessl scenario download --last

# Option A (alt): generate scenarios from PRs
$ tessl scenario generate org/repo --prs 42,107 --workspace=<workspace>
$ tessl scenario download --last

# Option B: write your own scenarios by hand
# Create evals/<name>/{task.md,criteria.json,scenario.json} — see "Writing your own scenarios"

# Then run and compare (same either way)
$ tessl eval run ./evals/ --workspace=<workspace>
$ tessl eval compare ./evals/ --breakdown --workspace=<workspace>

# Run with a specific model
$ tessl eval run ./evals/ --workspace=<workspace> --agent=claude:claude-opus-4-6

# Compare multiple Claude models
$ tessl eval run ./evals/ --workspace=<workspace> \
  --agent=claude:claude-sonnet-4-6 \
  --agent=claude:claude-opus-4-6 \
  --agent=claude:claude-haiku-4-5

# Test updated context against historical scenarios
$ tessl eval run ./evals/ --workspace=<workspace> --context-ref=HEAD

# Check status
$ tessl eval list --mine --type project
$ tessl eval view --last
```
