# Evaluate skill quality using scenarios

{% hint style="info" %}
This page covers scenario-based evals for **skills and tiles**. If you want to run evals against your codebase using real commit diffs, see [Evaluate your codebase](/evaluate/evaluating-your-codebase.md) instead.
{% endhint %}

Tessl lets you run end-to-end task evaluations for your skills directly from the CLI. You generate a set of scenarios, run an agent against them, and see how well it performs — with and without your skill injected. This workflow is designed for fast, repeatable iteration as you develop and refine a skill, without building your own eval harness.

## TL;DR

Generate scenarios against your skill, then test how the agent does with and without your skill present. For example:<br>

* You have a skill that says how to communicate with a system.
* Generate a set of scenarios using Tessl against that skill, or write your own, on how to communicate with that system.
* See if the agent can do the task in the scenario with or without the skill to determine how effective the skill is.

## Evals vs Reviews

The **Review Skills** feature reviews the skills against best practice, whereas **Evaluations** actually generates *scenarios* and then validate the quality of the skill, by testing if agents perform better against those scenario with the skill.\
\
You will use **Lint**, **Review,** and **Scenario based evals**, to make an effective tile.

### What you can do with scenario based evals

With self-serve evaluations, you can:

* Generate evaluation scenarios automatically using Tessl's [scenario generation skill](https://tessl.io/registry/tessl-labs/tessl-skill-eval-scenarios)
* Generate scenarios directly from the CLI with [`tessl scenario generate`](/reference/cli-commands.md#tessl-scenario-generate)
* Define custom evaluation scenarios manually
* Run end-to-end evaluations from the CLI
* View and track evaluation results in the Tessl web UI

### Prerequisites

* Logged into Tessl.
* Tessl [installed](/introduction-to-tessl/installation.md) (latest version).
* Access to a [workspace](/reference/workspaces.md) (you must be at least a [Publisher](/administrators/roles.md)).
* A skill packaged in a tile available locally.

***

### Step 1: Generate evaluation scenarios

There are three ways to create scenarios:

#### Option A: CLI (quickest — requires an existing tile)

```sh
$ tessl scenario generate <path/to/tile> --count=5
```

For tile-based generation, the workspace is read from `tile.json`. Don't pass `--workspace`. Make sure `tile.json` points at a workspace you have [publisher](/administrators/roles.md) rights on before running (see Step 3).

{% hint style="warning" %}
If you created the tile via `tessl skill import` without specifying a workspace, `tile.json` will default to `"name": "local/<skill>"` and `tessl scenario generate` will fail with `404 Workspace not found`. Update the workspace in `tile.json` (see Step 3) before running generate.
{% endhint %}

Generation runs server-side. Check progress with:

```sh
tessl scenario list --mine
```

Then download to disk once complete:

```sh
tessl scenario download --last
```

{% hint style="warning" %}
`tessl scenario download` places the `evals/` directory relative to your **current working directory**, but `tessl eval run <path/to/tile>` looks for `evals/` inside the **tile's directory**. If you ran the download from your project root, move the folder before running evals:

```sh
mv ./evals/ <path/to/tile-dir>/evals/
```

{% endhint %}

You can also generate scenarios from repository commits instead of a tile — see [`tessl scenario generate`](/reference/cli-commands.md#tessl-scenario-generate) for full options.

#### Option B: Agent-assisted (recommended if starting from a standalone skill file)

This approach uses a Tessl-provided skill to handle converting a standalone skill to a tile and generating scenarios in one guided flow.

First, install the scenario creator skill in your project:

```sh
tessl install tessl-labs/tessl-skill-eval-scenarios
```

Then prompt your agent (e.g. Claude):

```
"Create eval scenarios for <my_skill>"
```

Where `<my_skill>` is the name or path to your skill. This will:

1. Verify Tessl installation
2. Convert your skill to a tile (if it is not already in one)
3. Generate an initial set of scenarios

#### Option C: Write scenarios by hand

Create the directory structure manually. The only constraint is maintaining the structure of the three files per scenario (additional files will be ignored):

```
evals/
├── instructions.json
├── scenario-1/
│     ├── task.md
│     ├── criteria.json
│     └── capability.txt
├── scenario-2/
├── summary_infeasible.json
└── summary.json

<your-skill-name>/
├── SKILL.md
└── tile.json
```

If you need starting sample data or fixtures, those can be inlined into `task.md`.

***

### Step 2: (Optional) Review and edit scenarios

Each scenario contains:

* `capability.txt` — which capability of the skill this scenario tests
* `task.md` — the task brief shown to the agent
* `criteria.json` — the scoring rubric

Tessl auto-generates these, but for best results review them yourself. You're the ultimate authority on what your skill is intended to do and what success looks like.

{% hint style="info" %}
Automatic scenario generation creates criteria that reflect the instructions in the skill. Review these to check they reflect the outcomes you actually want the skill to achieve.
{% endhint %}

***

### Step 3: Check tile.json

Open `tile.json` in your skill folder and ensure the workspace name is updated to a workspace you have [publisher](/administrators/roles.md) rights on, and you have chosen a tile name.

```json
// Before
{ "name": "placeholder/repo-flow-mapper", ... }

// After
{ "name": "mycompany/repo-flow-mapper", ... }
```

Note that if the workspace name in the `tile.json` is one you do not have access to, or the correct permissions, you will not be able to run the evals!

***

### Step 4: Run the evaluation

Pass the path to your `tile.json`. The CLI looks for an `evals/` directory **inside the same directory as the tile**:

```sh
tessl eval run <path/to/tile>
```

For example, if your tile lives at `tiles/my-skill/tile.json`, your scenarios must be at `tiles/my-skill/evals/`. If you used Option A above and downloaded scenarios to your project root, move them first:

```sh
mv ./evals/ tiles/my-skill/evals/
tessl eval run tiles/my-skill/tile.json
```

You can attach a label to a run to help identify it later in `tessl eval list` or the Tessl web UI:

```sh
$ tessl eval run <path/to/tile> --label "testing prompt changes"
```

By default, evals run using Claude Sonnet. You can specify a different Claude model using `--agent`:

```sh
tessl eval run <path/to/tile> --agent=claude:claude-sonnet-4-6
tessl eval run <path/to/tile> --agent=claude:claude-opus-4-6
tessl eval run <path/to/tile> --agent=claude:claude-haiku-4-5
```

Supported Claude models:

| Model               | Notes                |
| ------------------- | -------------------- |
| `claude-sonnet-4-6` | Default              |
| `claude-opus-4-6`   | Most capable         |
| `claude-sonnet-4-5` |                      |
| `claude-opus-4-5`   |                      |
| `claude-haiku-4-5`  | Fastest, lowest cost |

This is useful if you want to evaluate your skill against the specific model you use in production, or to compare how your skill performs across different Claude models. Note that tile evals support one `--agent` per run — to compare models, run the command once per model and compare the results.

{% hint style="info" %}
**Benchmarking across models made easier**

To run evals across Haiku, Sonnet, and Opus in one guided flow — with automatic side-by-side comparison and gap diagnosis — use the [`review-model-performance`](https://tessl.io/registry/tessl-labs/review-model-performance) skill:

```sh
$ tessl install tessl-labs/review-model-performance
```

Then ask your agent: `"Run model comparison evals"`
{% endhint %}

You'll receive a URL in the terminal output/CLI to monitor progress and view results in the Tessl web UI.<br>

<figure><img src="/files/tyh0FpqZ0VMbrI6OTU8u" alt=""><figcaption></figcaption></figure>

The id for the next step can be found in the URL (i.e. 019c4791-9eec-7458-b28a-6c94405a3d38)

***

### Step 5: Review your results

Eval runs can take time. Use any of these to check status:

| Command                       | Description                            |
| ----------------------------- | -------------------------------------- |
| Visit the URL from Step 4     | Direct link to results                 |
| `tessl eval view <id>`        | View a specific eval run               |
| `tessl eval view --last`      | List last eval run with IDs and status |
| `tessl eval list`             | List all eval runs with IDs and status |
| `tessl eval view <id> --json` | Structured details on the eval run     |
| `tessl eval retry <id>`       | Retry a failed eval run                |

If you lose the id for your eval run, simply run `tessl eval list` to easily find it again.

**Example output from `tessl eval list`:**

<figure><img src="/files/Jm7WLranclqD17ZbebBC" alt=""><figcaption></figcaption></figure>

It's not uncommon for an attempt to not succeed on solving the eval scenario. You can either adjust the scenario or retry the scenario using `tessl eval retry <id>`.

***

### Step 6: Publish your Tile (Optional)

Since your skill was converted to a tile, you can now manage it at the tile level using the CLI.

To publish your tile to the Tessl registry - a new eval will only be run if you have not run an eval previously or if the content of your tile has changed since the last eval run:

`tessl tile publish`

To publish without running a new eval:

`tessl tile publish --skip-evals`

**Note:** Tiles created through this flow are published as `private` by default. To make your tile public, update `tile.json`: setting `"private": false` to the appropriate value.

For more tile management options, run:

* `tessl tile --help`


---

# Agent Instructions: Querying This Documentation

If you need additional information that is not directly available in this page, you can query the documentation dynamically by asking a question.

Perform an HTTP GET request on the current page URL with the `ask` query parameter:

```
GET https://docs.tessl.io/evaluate/evaluate-skill-quality-using-scenarios.md?ask=<question>
```

The question should be specific, self-contained, and written in natural language.
The response will contain a direct answer to the question and relevant excerpts and sources from the documentation.

Use this mechanism when the answer is not explicitly present in the current page, you need clarification or additional context, or you want to retrieve related documentation sections.
