Evaluate skill quality using scenarios

circle-info

This page covers scenario-based evals for skills and tiles. If you want to run evals against your codebase using real commit diffs, see Evaluate your codebase instead.

Tessl lets you run end-to-end task evaluations for your skills directly from the CLI. You generate a set of scenarios, run an agent against them, and see how well it performs — with and without your skill injected. This workflow is designed for fast, repeatable iteration as you develop and refine a skill, without building your own eval harness.

TL;DR

Generate scenarios against your skill, then test how the agent does with and without your skill present. For example:

  • You have a skill that says how to communicate with a system.

  • Generate a set of scenarios using Tessl against that skill, or write your own, on how to communicate with that system.

  • See if the agent can do the task in the scenario with or without the skill to determine how effective the skill is.

Evals vs Reviews

The Review Skills feature reviews the skills against best practice, whereas Evaluations actually generates scenarios and then validate the quality of the skill, by testing if agents perform better against those scenario with the skill. You will use Lint, Review, and Scenario based evals, to make an effective tile.

What you can do with scenario based evals

With self-serve evaluations, you can:

Prerequisites

  • Logged into Tessl.

  • Tessl installed (latest version).

  • Access to a workspace (you must be at least a Publisher).

  • A skill packaged in a tile available locally.


Step 1: Generate evaluation scenarios

There are three ways to create scenarios:

Option A: CLI (quickest — requires an existing tile)

Generation runs server-side. Check progress with:

Then download to disk once complete:

circle-exclamation

You can also generate scenarios from repository commits instead of a tile — see tessl scenario generate for full options.

This approach uses a Tessl-provided skill to handle converting a standalone skill to a tile and generating scenarios in one guided flow.

First, install the scenario creator skill in your project:

Then prompt your agent (e.g. Claude):

Where <my_skill> is the name or path to your skill. This will:

  1. Verify Tessl installation

  2. Convert your skill to a tile (if it is not already in one)

  3. Generate an initial set of scenarios

Option C: Write scenarios by hand

Create the directory structure manually. The only constraint is maintaining the structure of the three files per scenario (additional files will be ignored):

If you need starting sample data or fixtures, those can be inlined into task.md.


Step 2: (Optional) Review and edit scenarios

Each scenario contains:

  • capability.txt — which capability of the skill this scenario tests

  • task.md — the task brief shown to the agent

  • criteria.json — the scoring rubric

Tessl auto-generates these, but for best results review them yourself. You're the ultimate authority on what your skill is intended to do and what success looks like.

circle-info

Automatic scenario generation creates criteria that reflect the instructions in the skill. Review these to check they reflect the outcomes you actually want the skill to achieve.


Step 3: Check tile.json

Open tile.json in your skill folder and ensure the workspace name is updated to a workspace you have publisher rights on, and you have chosen a tile name.

Note that if the workspace name in the tile.json is one you do not have access to, or the correct permissions, you will not be able to run the evals!


Step 4: Run the evaluation

Pass the path to your tile.json. The CLI looks for an evals/ directory inside the same directory as the tile:

For example, if your tile lives at tiles/my-skill/tile.json, your scenarios must be at tiles/my-skill/evals/. If you used Option A above and downloaded scenarios to your project root, move them first:

You can attach a label to a run to help identify it later in tessl eval list or the Tessl web UI:

By default, evals run using Claude Sonnet. You can specify a different Claude model using --agent:

Supported Claude models:

Model
Notes

claude-sonnet-4-6

Default

claude-opus-4-6

Most capable

claude-sonnet-4-5

claude-opus-4-5

claude-haiku-4-5

Fastest, lowest cost

This is useful if you want to evaluate your skill against the specific model you use in production, or to compare how your skill performs across different Claude models. Note that tile evals support one --agent per run — to compare models, run the command once per model and compare the results.

circle-info

Benchmarking across models made easier

To run evals across Haiku, Sonnet, and Opus in one guided flow — with automatic side-by-side comparison and gap diagnosis — use the review-model-performancearrow-up-right skill:

Then ask your agent: "Run model comparison evals"

You'll receive a URL in the terminal output/CLI to monitor progress and view results in the Tessl web UI.

The id for the next step can be found in the URL (i.e. 019c4791-9eec-7458-b28a-6c94405a3d38)


Step 5: Review your results

Eval runs can take time. Use any of these to check status:

Command
Description

Visit the URL from Step 4

Direct link to results

tessl eval view <id>

View a specific eval run

tessl eval view --last

List last eval run with IDs and status

tessl eval list

List all eval runs with IDs and status

tessl eval view <id> --json

Structured details on the eval run

tessl eval retry <id>

Retry a failed eval run

If you lose the id for your eval run, simply run tessl eval list to easily find it again.

Example output from tessl eval list:

It's not uncommon for an attempt to not succeed on solving the eval scenario. You can either adjust the scenario or retry the scenario using tessl eval retry <id>.


Step 6: Publish your Tile (Optional)

Since your skill was converted to a tile, you can now manage it at the tile level using the CLI.

To publish your tile to the Tessl registry - a new eval will only be run if you have not run an eval previously or if the content of your tile has changed since the last eval run:

tessl tile publish

To publish without running a new eval:

tessl tile publish --skip-evals

Note: Tiles created through this flow are published as private by default. To make your tile public, update tile.json: setting "private": false to the appropriate value.

For more tile management options, run:

  • tessl tile --help

Last updated