> For the complete documentation index, see [llms.txt](https://docs.tessl.io/llms.txt). Markdown versions of documentation pages are available by appending `.md` to page URLs; this page is available as [Markdown](https://docs.tessl.io/improving-your-skills/evaluating-your-codebase.md).

# Evaluate your codebase agent readiness

{% hint style="warning" %}
This feature is in public beta. Workflows and output formats may change in upcoming releases.
{% endhint %}

{% hint style="info" %}
Eval runs require a Tessl project. A Tessl project is the stable home for Tessl in a Git repository. It identifies that repository and acts as the anchor for the eval runs and other repository-connected data attached to it. See [Projects overview](/projects/overview.md) and [Projects for evals](/projects/evals.md).
{% endhint %}

This page outlines a toolkit for measuring your codebase's **agent readiness** — how well your context files (skills, rules, documentation) enable an AI agent to complete real tasks on your codebase. It covers scenario definition, running agents with different setups, testing variations, and comparing results.

## TL;DR

Tessl Evals is a toolkit for measuring your codebase's **agent readiness** — specifically, the impact that your agent configuration, model choice, and context files (skills, CLAUDE.md, etc.) are having on your agent's ability to work on your codebase.

For example, you might want to know whether the skills and MD files your agent uses are helping or hurting its ability to complete real tasks. Or you might want to test how a different agent or model would perform on your codebase, and what it would cost.

The key idea is that instead of synthetic tests, you base evaluations on real work that's already been done. Here's how it fits together:

* **You identify a realistic task**. By looking at recent commits, you find a meaningful change that was made to your codebase, for example:
  * a commit that added a fraud detection report feature. This becomes the basis for your scenario.
* **You generate a scenario from it.** Tessl analyses the commit diff and produces a task description (what an agent would be asked to do) and a scoring rubric (how to judge whether the agent did it correctly) — essentially reconstructing the intent of the original change as an agent task.
* **You choose what to test.** By default the run is the baseline — the scenario as authored. Pass `--context` with the files or plugin under test to add a second variant with that context injected, and compare the two to see the delta. Or run the same task against a different agent or model to compare performance and cost.
* **You read the results.** The scores tell you, concretely, the extent to which your agentic setup is helping or hindering your agent's ability to complete the kinds of tasks that are actually being done on your codebase.

{% hint style="info" %}
There are different tools in the Tessl toolkit for improving your skills, refer to the [Overview](/improving-your-skills/overview-improving-skills-and-plugins.md) for more information on reviews, evals, and skills available.
{% endhint %}

## How it works

A scenario captures a real task: a starting state of the codebase, a task brief, and a scoring rubric. For scenarios generated from a commit, that starting state has your context files (skills, rules, docs) stripped out — that stripping is a `tessl scenario generate` step (the `fixtures.codebase.exclude` patterns in `scenario.json`), covered in [Step 2](#step-2-optional-generate-scenarios-from-commits-or-prs). Running an eval just runs whatever scenarios you point it at.

Each run produces up to two variants:

* **Baseline** — the agent solves the task from the scenario's starting state, as authored.
* **With context** — pass `--context` at run time with the files or plugin under test, and the same task is solved with that context injected on top.

Comparing the two scores is your **agent readiness signal** — whether your skills, rules, and documentation make the agent more effective. With no `--context`, only the baseline runs.

## Prerequisites

* Tessl [installed](/introduction-to-tessl/installation.md) (latest version)
* Logged into Tessl
* Access to a [workspace](/reference/workspaces.md) (you must be at least a [Member](/administrators/roles.md))
* Your GitHub or GitLab account connected in workspace settings
* A Tessl project linked to the repository you want to evaluate

***

## Shortcut: use the eval skills

If you'd prefer to have an AI agent walk you through the process interactively, two Tessl skills automate the full pipeline:

* [**tessl-labs/eval-setup**](https://tessl.io/registry/tessl-labs/eval-setup) — guides you through commit selection, scenario generation, multi-agent configuration, and your first eval run
* [**tessl-labs/eval-improve**](https://tessl.io/registry/tessl-labs/eval-improve) — analyzes your results, classifies failing criteria, applies targeted fixes to your context files, and re-runs to verify improvement
* [**tessl-labs/scenarios-review**](https://tessl.io/registry/tessl-labs/scenarios-review) — audits your scenario suite for rubric quality, task clarity, coverage gaps, and difficulty spread, and prints a concise quality report

```sh
tessl install tessl-labs/eval-setup
tessl install tessl-labs/eval-improve
tessl install tessl-labs/scenarios-review
```

Once installed, invoke them with `/eval-setup`, `/eval-improve`, or `/scenarios-review` in your AI coding agent.

***

## Step 1: (Optional) Browse commits and pick what to evaluate

A **scenario** defines a task for an agent, the starting state of the codebase, and a scoring rubric to evaluate the agent's output.

Scenarios can be [written by hand](#writing-your-own-scenarios) — skip to [Step 5](#step-5-run-the-eval) if you already have them. If you already have specific commit hashes or PR numbers in mind, skip straight to [Step 2](#step-2-optional-generate-scenarios-from-commits-or-prs). Otherwise, use `tessl repo select-commits` to browse recent commits and choose which ones to turn into scenarios. This command is not listed in `tessl --help` but works when invoked directly.

```sh
$ tessl repo select-commits <org/repo>
```

**Useful flags:**

| Flag                  | Example               | Description                       |
| --------------------- | --------------------- | --------------------------------- |
| `--keyword`           | `--keyword feat`      | Filter by commit message keyword  |
| `--author`            | `--author "Alice"`    | Filter by author name             |
| `--since` / `--until` | `--since 2026-01-01`  | Date range (YYYY-MM-DD)           |
| `--count` / `-n`      | `--count 20`          | Number of commits to show (1–100) |
| `--workspace` / `-w`  | `--workspace engteam` | Required outside interactive mode |

**Output:** a table of `Hash | Date | Author | Message`. Copy the hashes you want to pass to the next step.

**Prerequisite:** your GitHub or GitLab account must be connected in workspace settings. If it isn't, the error message includes a direct link to the settings page.

***

## Step 2: (Optional) Generate scenarios from commits or PRs

Tessl analyzes the commit diffs and generates a set of task scenarios. You can provide either commit hashes or pull request numbers:

```sh
# From commits
$ tessl scenario generate <org/repo> --commits <hash1>,<hash2>

# From PRs
$ tessl scenario generate <org/repo> --prs <number1>,<number2>
```

**Flags:**

| Flag                 | Example                   | Description                                   |
| -------------------- | ------------------------- | --------------------------------------------- |
| `--commits`          | `--commits abc123,def456` | Comma-separated commit hashes                 |
| `--prs`              | `--prs 42,107`            | Comma-separated PR numbers                    |
| `--context`          | `--context "*.mdc,*.md"`  | Glob patterns identifying your context files  |
| `--workspace` / `-w` | `--workspace engteam`     | Required outside interactive mode             |
| `--json`             |                           | Output generation IDs as JSON without polling |

### The --context flag

`--context` tells Tessl which files in your repo are context files — skills, rules, documentation, etc. The patterns are stored in each generated `scenario.json` as `fixtures.codebase.exclude`, which **strips those files from the commit** so the scenario's starting state is the repo without them.

To compare with and without that context, pass `tessl eval run --context` with a glob matching the same files when you run the eval (Step 6) — it injects them back so you can measure the delta.

When omitted, Tessl defaults to: `*.mdc`, `*.md`, `tile.json`, `tessl.json`, `.tessl/`, `.tessl-plugin/plugin.json`

Generation runs server-side. The CLI polls until complete. If you press Ctrl-C, the job keeps running - check on it later with `tessl scenario list`.

### Output

When generation completes, the CLI prints a table with the **Scenario ID** (the ID you pass to `tessl scenario download`) and either the source commit or PR:

```
# From --commits
✔ Generated 3 scenarios
Scenario ID                           Commit ID
019cb3cf-5f2b-776d-aead-5714d74ef567  f9256db
019cb3d1-10fc-755e-af3f-c079a32d8c49  a13eb72
019cb3d3-abcd-1234-beef-cafe00000000  3cd6e7b

# From --prs
✔ Generated 1 scenario
Scenario ID                           PR
019cb9f3-e169-73fd-b0e4-5943e244b915  PR #1234
```

Each commit or PR produces its own generation with its own Scenario ID. Keep these IDs handy — you'll need them in Step 4 if you passed multiple commits or PRs.

***

## Step 3: (Optional) Review the generation

*Applies only if you used `tessl scenario generate` in Step 2.*

Before downloading, you can inspect what was generated.

```sh
# List recent generations
$ tessl scenario list

# Inspect a specific generation
$ tessl scenario view <id>

# Inspect the most recent generation
$ tessl scenario view --last
```

`tessl scenario list` shows a table of ID, Workspace, Status, Created By, and Created. `tessl scenario view` shows metadata and a table of generated scenarios with titles and checklist item counts.

***

## Step 4: (Optional) Download scenarios to disk

*Applies only if you used `tessl scenario generate` in Steps 1–2. If you wrote your own scenarios by hand, skip to* [*Step 5*](#step-5-run-the-eval)*.*

```sh
$ tessl scenario download --last
```

Or with a specific ID:

```sh
$ tessl scenario download <id>
```

{% hint style="info" %}
`--last` downloads scenarios from the single most recent generation. If you passed multiple commits to `tessl scenario generate`, each commit produces its own generation with its own ID — `--last` will only get the most recent one. Use `tessl scenario list` to find the other IDs and download each separately.
{% endhint %}

**Flags:**

| Flag                | Description                                                                               |
| ------------------- | ----------------------------------------------------------------------------------------- |
| `--last`            | Download from the most recent generation                                                  |
| `--output` / `-o`   | Output directory (default: `evals`)                                                       |
| `--strategy` / `-s` | `merge` (default) adds alongside existing scenarios; `replace` clears the directory first |

**What lands on disk:**

```
evals/
  <7-char-hash>-<slug>/
    task.md          ← task brief shown to the agent
    criteria.json    ← weighted checklist rubric
    scenario.json    ← fixture with repo URL, commit ref, and context exclude patterns
```

You can edit `task.md` and `criteria.json` before running — your edits are picked up at run time. See [File formats](#file-formats) below.

***

## Step 5: Create a project

Before you run an eval for the first time, you'll need ensure a Tessl project is created and linked.

From the directory you want to run your evals, run the following:

```bash
$ tessl project create <project-name> --workspace <workspace-name-or-id>
```

This will then update your `tessl.json` to specify that the project now belongs to this provided workspace:

<pre class="language-diff"><code class="lang-diff"> {
<strong>-  "name": "my-project",
</strong>+  "name": "engteam/my-project",
 }
</code></pre>

***

## Step 6: Run the eval

CLI-triggered eval runs require this directory to be linked to a Tessl project first. See [Projects overview](/projects/overview.md).

If this repository is not linked yet, the CLI may ask you to create or link a project before the run starts. See [Manage projects from the CLI](/projects/manage-projects-from-the-cli.md).

When this directory is linked to a Tessl project, eval runs stay connected to that project over time. This lets Tessl recognise the same work across repeated runs on the same repository.

Runs attach to that Tessl project rather than being treated as isolated local runs.

Run from the parent directory of your scenarios folder, passing `--context` with the context files you want to measure:

```sh
$ tessl eval run ./evals/ --context 'src/**/*.knowledge.md'
```

This runs the baseline (each scenario from its stripped starting state) and a with-context variant (the same state with your `--context` files injected back), so you can compare the two. Match the `--context` glob to the files each scenario excludes. With no `--context`, only the baseline runs. The run uses the workspace of the linked Tessl project — there is no workspace flag.

Pass `--agent <agent:model>` to choose the agent and model; run `tessl eval run --list-agents` to see the supported values and the current default.

### Running with a specific agent

Override the agent and model with `--agent`:

```sh
$ tessl eval run ./evals/ --agent claude:claude-opus-4-6
```

Run `tessl eval run --list-agents` to see the supported agents and models and which is the default.

An eval run takes a single `--agent`. To compare models, run the command once per model and compare the results:

```sh
$ tessl eval run ./evals/ --agent claude:claude-sonnet-4-6
$ tessl eval run ./evals/ --agent claude:claude-opus-4-6
```

### Output

```
✔ Ran 5 scenarios with 1 agent
  ✔ claude:claude-sonnet-4-6  019c81f5-ac2a-746f-bed1-78a875a0480d

  View in browser: https://tessl.io/eval-runs/019c81f5-ac2a-746f-bed1-78a875a0480d
ℹ Run tessl eval view <id> to see details for a run.
```

Ctrl-C detaches without cancelling — runs continue server-side. The CLI prints each run ID so you can check progress later with `tessl eval view <id>` or `tessl eval list`.

{% hint style="info" %}
**Results can vary between runs.** Because you're evaluating an AI agent, scores are not fully deterministic — the same scenario run twice may produce slightly different results. Treat scores as signals rather than exact measurements, and expect some run-to-run variance when comparing results.
{% endhint %}

***

## Checking status and viewing results

| Command                  | Description                               |
| ------------------------ | ----------------------------------------- |
| `tessl eval list`        | List all eval runs                        |
| `tessl eval list --mine` | Only your runs                            |
| `tessl eval view <id>`   | Detailed results for a specific run       |
| `tessl eval view --last` | Detailed results for your most recent run |
| `tessl eval retry <id>`  | Re-run a failed eval                      |

If you lose a run ID, `tessl eval list` will find it.

***

## File formats

### scenario.json

Generated by `tessl scenario download`. Declares the commit fixture the eval run starts from.

```json
{
  "fixtures": {
    "codebase": {
      "type": "commit",
      "repoUrl": "https://github.com/org/repo.git",
      "ref": "24829180ba1fafb86b...",
      "exclude": ["*.mdc", "*.md", ".tessl-plugin/plugin.json", "tile.json", "tessl.json", ".tessl/"]
    }
  }
}
```

* `fixtures.codebase.ref` — the parent commit hash (the starting state for the agent)
* `fixtures.codebase.exclude` — context patterns stripped from the commit to form the scenario's starting state
* `fixtures.codebase.repoUrl` — full clone URL

For the full schema (including `description`, `include`, `setup`, the `directory` fixture type, and conventional defaults), see [scenario.json](/reference/configuration.md#scenario-json) in the configuration reference.

### task.md

Free-form markdown. This is the **only file the agent sees** — it has no access to `criteria.json`. Typically structured with Problem, Expected Behavior, and Acceptance Criteria sections. You can edit this freely before running.

### criteria.json

Defines how the agent's solution is scored.

```json
{
  "context": "Brief description of what is being evaluated.",
  "type": "weighted_checklist",
  "checklist": [
    {
      "name": "fixes_syntax_error",
      "description": "The solution adds the missing closing parenthesis...",
      "max_score": 1,
      "category": "INTENT"
    }
  ]
}
```

Required fields: `context`, `type` (`"weighted_checklist"`), `checklist` (array with `name`, `description`, `max_score`).

**Checklist categories:**

| Category      | Description                                                                                            |
| ------------- | ------------------------------------------------------------------------------------------------------ |
| `INTENT`      | Core feature or behavior the change introduces; verifies the solution addresses what the task requests |
| `DESIGN`      | Architectural or structural choices                                                                    |
| `MUST_NOT`    | Things the solution should avoid or never do                                                           |
| `MINIMALITY`  | Appropriate scope of changes — solution does what's needed without overreaching                        |
| `REUSE`       | Leveraging existing utilities or patterns rather than reimplementing                                   |
| `INTEGRATION` | How the solution connects with existing code                                                           |
| `EDGE_CASE`   | Boundary conditions handled correctly                                                                  |

**Scoring:** `(sum of scores / sum of max_scores) × 100`. The LLM grader can award partial credit.

***

## Writing your own scenarios

You can hand-author scenarios without using `tessl scenario generate`:

```
evals/
  my-custom-scenario/
    task.md
    criteria.json
    scenario.json
```

`fixtures.codebase.ref` should be the parent of the ground truth commit. `fixtures.codebase.exclude` defines what gets stripped from the commit to form the scenario's starting state.

***

## Quick reference

```sh
# Option A: generate scenarios from commits
$ tessl repo select-commits org/repo --keyword feat --workspace <workspace>
$ tessl scenario generate org/repo --commits abc123,def456 --workspace <workspace>
$ tessl scenario download --last

# Option A (alt): generate scenarios from PRs
$ tessl scenario generate org/repo --prs 42,107 --workspace <workspace>
$ tessl scenario download --last

# Option B: write your own scenarios by hand
# Create evals/<name>/{task.md,criteria.json,scenario.json} — see "Writing your own scenarios"

# Then run (same either way)
$ tessl eval run ./evals/

# Run with a specific model
$ tessl eval run ./evals/ --agent claude:claude-opus-4-6

# Compare Claude models (one run per model)
$ tessl eval run ./evals/ --agent claude:claude-sonnet-4-6
$ tessl eval run ./evals/ --agent claude:claude-opus-4-6
$ tessl eval run ./evals/ --agent claude:claude-haiku-4-5

# Check status
$ tessl eval list --mine
$ tessl eval view --last
```


---

# Agent Instructions
This documentation is published with GitBook. GitBook is the documentation platform designed so that both humans and AI agents can read, navigate, and reason over technical content effectively. Learn more at gitbook.com.

## Querying This Documentation
If you need additional information that is not directly available in this page, you can query the documentation dynamically by asking a question.

Perform an HTTP GET request on the current page URL with the `ask` query parameter, and the optional `goal` query parameter:

```
GET https://docs.tessl.io/improving-your-skills/evaluating-your-codebase.md?ask=<question>&goal=<endgoal>
```

`ask` is the immediate question: it should be specific, self-contained, and written in natural language.
`goal` is optional and describes the broader end goal you are ultimately trying to accomplish on behalf of the user. GitBook uses it to tailor the answer towards what is most useful for that goal.

The response will contain a direct answer to the question and relevant excerpts and sources from the documentation.

Use this mechanism when the answer is not explicitly present in the current page, you need clarification or additional context, or you want to retrieve related documentation sections.