Evaluate skill quality using scenarios

How to evaluate skills to ensure they improve agent workflows

Evals vs Reviews

The Review Skills feature reviews the skills against best practice, whereas Evaluations actually generates scenarios and then validate the quality of the skill, by testing if agents perform better against those scenario with the skill.

Prerequisites

Logged into Tessl
Tessl installed (latest version)
Access to a workspace (you must be at least a Publisher)
A skill you want to evaluate

Step 1: Install the "Tessl eval scenario creator" skill

In your project that has your skill files, run the following command to install the Tessl skill that creates scenarios:

tessl install tessl-labs/tessl-skill-eval-scenarios

Step 2: Generate scenarios

Prompt your agent (e.g. Claude):

"Create eval scenarios for my skill"

This will:

Verify Tessl installation
Convert your skill to a tile (if it is not already in one) - this is a necessary prerequisite to running evaluations
Generate an initial set of scenarios

This should generate a tile with a set of eval scenarios within locally.

Expected output of eval folder within tile:

evals/
├── capabilities.json
├── scenario-1/
├── scenario-2/
├── scenario-3/
├── scenario-4/
├── summary_infeasible.json
└── summary.json

<your-skill-name>/
├── SKILL.md
└── tile.json

Step 2a (Optional): Review and update eval scenarios

Each scenario contains:

A capability.text containing which capabilities of the skill this scenario is testing
A TASK.md containing a task for an agent to solve
A criteria.json describing the way an agent's solution to the task will be graded

Tessl auto generates these, but for best results we recommend reviewing these yourself and editing them to your preferences. The ultimate authority on what your skill is intended to do and what success using it looks like is you!

Step 3: Update tile.json

Open tile.json in your skill folder and ensure the workspace name is updated to a workspace you have publisher right on, and you have chosen a tile name.

// Before
{ "name": "placeholder/repo-flow-mapper", ... }

// After
{ "name": "mycompany/repo-flow-mapper", ... }

Note that the workspace name in the tile.json is one you do not have access to, you will not be able to run the evals!

Step 4: Run the evaluation

From the parent directory of the tile:

tessl eval run <tile>

You'll receive a URL in the terminal output/CLI to monitor progress and view results in the Tessl web UI.

The id for the next step can be found in the URL (i.e. 019c4791-9eec-7458-b28a-6c94405a3d38)

Step 5: Review your results

Eval runs can take time. Use any of these to check status:

Command

Description

Visit the URL from Step 4

Direct link to results

tessl eval view-results <id>

View a specific eval run

tessl eval view-results --last

List last eval run with IDs and status

tessl eval list

List all eval runs with IDs and status

If you lose the id for your eval run, simply run tessl eval list to easily find it again.

Example output from tessl eval list:

Step 6: Publish your Tile (Optional)

Since your skill was converted to a tile, you can now manage it at the tile level using the CLI.

To publish your tile to the Tessl registry:

tessl tile publish

To publish without running a new eval:

tessl tile publish --skip-evals

Note: Tiles created through this flow are published as private by default. To make your tile public, update tile.json: setting "private": false to the appropriate value.

For more tile management options, run:

tessl tile --help

PreviousReview skill against best practice NextEvaluating documentation

Last updated 15 minutes ago

Good evening

hashtagEvals vs Reviews

hashtagPrerequisites

hashtagStep 1: Install the "Tessl eval scenario creator" skill

hashtagStep 2: Generate scenarios

hashtagStep 2a (Optional): Review and update eval scenarios

hashtagStep 3: Update tile.json

hashtagStep 4: Run the evaluation

hashtagStep 5: Review your results

hashtagStep 6: Publish your Tile (Optional)