Evaluate skill quality using scenarios
This page covers scenario-based evals for skills and tiles. If you want to run evals against your codebase using real commit diffs, see Evaluate your codebase instead.
Tessl lets you run end-to-end task evaluations for your skills directly from the CLI. You generate a set of scenarios, run an agent against them, and see how well it performs — with and without your skill injected. This workflow is designed for fast, repeatable iteration as you develop and refine a skill, without building your own eval harness.
TL;DR
Generate scenarios against your skill, then test how the agent does with and without your skill present. For example:
You have a skill that says how to communicate with a system.
Generate a set of scenarios using Tessl against that skill, or write your own, on how to communicate with that system.
See if the agent can do the task in the scenario with or without the skill to determine how effective the skill is.
Evals vs Reviews
The Review Skills feature reviews the skills against best practice, whereas Evaluations actually generates scenarios and then validate the quality of the skill, by testing if agents perform better against those scenario with the skill. You will use Lint, Review, and Scenario based evals, to make an effective tile.
What you can do with scenario based evals
With self-serve evaluations, you can:
Generate evaluation scenarios automatically using Tessl's scenario generation skill
Generate scenarios directly from the CLI with
tessl scenario generateDefine custom evaluation scenarios manually
Run end-to-end evaluations from the CLI
View and track evaluation results in the Tessl web UI
Prerequisites
Logged into Tessl.
Tessl installed (latest version).
A skill packaged in a tile available locally.
Step 1: Generate evaluation scenarios
There are three ways to create scenarios:
Option A: CLI (quickest — requires an existing tile)
Generation runs server-side. Check progress with:
Then download to disk once complete:
tessl scenario download places the evals/ directory relative to your current working directory, but tessl eval run <path/to/tile> looks for evals/ inside the tile's directory. If you ran the download from your project root, move the folder before running evals:
You can also generate scenarios from repository commits instead of a tile — see tessl scenario generate for full options.
Option B: Agent-assisted (recommended if starting from a standalone skill file)
This approach uses a Tessl-provided skill to handle converting a standalone skill to a tile and generating scenarios in one guided flow.
First, install the scenario creator skill in your project:
Then prompt your agent (e.g. Claude):
Where <my_skill> is the name or path to your skill. This will:
Verify Tessl installation
Convert your skill to a tile (if it is not already in one)
Generate an initial set of scenarios
Option C: Write scenarios by hand
Create the directory structure manually. The only constraint is maintaining the structure of the three files per scenario (additional files will be ignored):
If you need starting sample data or fixtures, those can be inlined into task.md.
Step 2: (Optional) Review and edit scenarios
Each scenario contains:
capability.txt— which capability of the skill this scenario teststask.md— the task brief shown to the agentcriteria.json— the scoring rubric
Tessl auto-generates these, but for best results review them yourself. You're the ultimate authority on what your skill is intended to do and what success looks like.
Automatic scenario generation creates criteria that reflect the instructions in the skill. Review these to check they reflect the outcomes you actually want the skill to achieve.
Step 3: Check tile.json
Open tile.json in your skill folder and ensure the workspace name is updated to a workspace you have publisher rights on, and you have chosen a tile name.
Note that if the workspace name in the tile.json is one you do not have access to, or the correct permissions, you will not be able to run the evals!
Step 4: Run the evaluation
Pass the path to your tile.json. The CLI looks for an evals/ directory inside the same directory as the tile:
For example, if your tile lives at tiles/my-skill/tile.json, your scenarios must be at tiles/my-skill/evals/. If you used Option A above and downloaded scenarios to your project root, move them first:
You can attach a label to a run to help identify it later in tessl eval list or the Tessl web UI:
By default, evals run using Claude Sonnet. You can specify a different Claude model using --agent:
Supported Claude models:
claude-sonnet-4-6
Default
claude-opus-4-6
Most capable
claude-sonnet-4-5
claude-opus-4-5
claude-haiku-4-5
Fastest, lowest cost
This is useful if you want to evaluate your skill against the specific model you use in production, or to compare how your skill performs across different Claude models. Note that tile evals support one --agent per run — to compare models, run the command once per model and compare the results.
Benchmarking across models made easier
To run evals across Haiku, Sonnet, and Opus in one guided flow — with automatic side-by-side comparison and gap diagnosis — use the review-model-performance skill:
Then ask your agent: "Run model comparison evals"
You'll receive a URL in the terminal output/CLI to monitor progress and view results in the Tessl web UI.

The id for the next step can be found in the URL (i.e. 019c4791-9eec-7458-b28a-6c94405a3d38)
Step 5: Review your results
Eval runs can take time. Use any of these to check status:
Visit the URL from Step 4
Direct link to results
tessl eval view <id>
View a specific eval run
tessl eval view --last
List last eval run with IDs and status
tessl eval list
List all eval runs with IDs and status
tessl eval view <id> --json
Structured details on the eval run
tessl eval retry <id>
Retry a failed eval run
If you lose the id for your eval run, simply run tessl eval list to easily find it again.
Example output from tessl eval list:

It's not uncommon for an attempt to not succeed on solving the eval scenario. You can either adjust the scenario or retry the scenario using tessl eval retry <id>.
Step 6: Publish your Tile (Optional)
Since your skill was converted to a tile, you can now manage it at the tile level using the CLI.
To publish your tile to the Tessl registry - a new eval will only be run if you have not run an eval previously or if the content of your tile has changed since the last eval run:
tessl tile publish
To publish without running a new eval:
tessl tile publish --skip-evals
Note: Tiles created through this flow are published as private by default. To make your tile public, update tile.json: setting "private": false to the appropriate value.
For more tile management options, run:
tessl tile --help
Last updated

