Evaluate your codebase agent readiness
This feature is in public beta. Workflows and output formats may change in upcoming releases.
Codebase eval runs require a Tessl project. A Tessl project is the stable home for Tessl in a Git repository. It identifies that repository and acts as the anchor for the eval runs and other repository-connected data attached to it. See Projects overview and Projects for evals.
This page outlines a toolkit for measuring your codebase's agent readiness — how well your context files (skills, rules, documentation) enable an AI agent to complete real tasks on your codebase. It covers scenario definition, running agents with different setups, testing variations, and comparing results.
TL;DR
Tessl Evals is a toolkit for measuring your codebase's agent readiness — specifically, the impact that your agent configuration, model choice, and context files (skills, CLAUDE.md, etc.) are having on your agent's ability to work on your codebase.
For example, you might want to know whether the skills and MD files your agent uses are helping or hurting its ability to complete real tasks. Or you might want to test how a different agent or model would perform on your codebase, and what it would cost.
The key idea is that instead of synthetic tests, you base evaluations on real work that's already been done. Here's how it fits together:
You identify a realistic task. By looking at recent commits, you find a meaningful change that was made to your codebase, for example:
a commit that added a fraud detection report feature. This becomes the basis for your scenario.
You generate a scenario from it. Tessl analyses the commit diff and produces a task description (what an agent would be asked to do) and a scoring rubric (how to judge whether the agent did it correctly) — essentially reconstructing the intent of the original change as an agent task.
You define the experiment. This is where you choose what to test. You might run the scenario with your context files stripped out to get a baseline — then run it again with them injected, so you can see the delta. Or you might run the same task against multiple agents or models to compare performance and cost.
You read the results. The scores tell you, concretely, the extent to which your agentic setup is helping or hindering your agent's ability to complete the kinds of tasks that are actually being done on your codebase.
There are different tools in the Tessl toolkit for improving your skills, refer to the Overview for more information on reviews, the different kinds of evals, and skills available.
How it works
Each scenario is solved twice by default:
Baseline — the agent works on the repo with your context files stripped out
With context — the agent works on the repo with your context files injected
Comparing the two scores gives you a concrete agent readiness signal — whether your skills, rules, and documentation are making the agent more effective on real tasks.
Prerequisites
Tessl installed (latest version)
Logged into Tessl
Your GitHub or GitLab account connected in workspace settings
A Tessl project linked to the repository you want to evaluate
Shortcut: use the eval skills
If you'd prefer to have an AI agent walk you through the process interactively, two Tessl skills automate the full pipeline:
tessl-labs/eval-setup — guides you through commit selection, scenario generation, multi-agent configuration, and your first eval run
tessl-labs/eval-improve — analyzes your results, classifies failing criteria, applies targeted fixes to your context files, and re-runs to verify improvement
tessl-labs/scenarios-review — audits your scenario suite for rubric quality, task clarity, coverage gaps, and difficulty spread, and prints a concise quality report
Once installed, invoke them with /eval-setup, /eval-improve, or /scenarios-review in your AI coding agent.
Step 1: (Optional) Browse commits and pick what to evaluate
A scenario defines a task for an agent, the starting state of the codebase, and a scoring rubric to evaluate the agent's output.
Scenarios can be written by hand — skip to Step 5 if you already have them. If you already have specific commit hashes or PR numbers in mind, skip straight to Step 2. Otherwise, use tessl repo select-commits to browse recent commits and choose which ones to turn into scenarios. This command is not listed in tessl --help but works when invoked directly.
Useful flags:
--keyword
--keyword=feat
Filter by commit message keyword
--author
--author="Alice"
Filter by author name
--since / --until
--since=2026-01-01
Date range (YYYY-MM-DD)
--count / -n
--count=20
Number of commits to show (1–100)
--workspace / -w
--workspace=engteam
Required outside interactive mode
Output: a table of Hash | Date | Author | Message. Copy the hashes you want to pass to the next step.
Prerequisite: your GitHub or GitLab account must be connected in workspace settings. If it isn't, the error message includes a direct link to the settings page.
Step 2: (Optional) Generate scenarios from commits or PRs
Tessl analyzes the commit diffs and generates a set of task scenarios. You can provide either commit hashes or pull request numbers:
Flags:
--commits
--commits=abc123,def456
Comma-separated commit hashes
--prs
--prs 42,107
Comma-separated PR numbers
--context
--context="*.mdc,*.md"
Glob patterns identifying your context files
--workspace / -w
--workspace=engteam
Required outside interactive mode
--json
Output generation IDs as JSON without polling
The --context flag
--context tells Tessl which files in your repo are context files — skills, rules, documentation, etc. These patterns are stored in each generated scenario.json as fixture.exclude and serve two purposes:
They are stripped from the repo for the baseline run so the agent works without context
They are injected back for the with-context run so you can measure the delta
When omitted, Tessl defaults to: *.mdc, *.md, tile.json, .tessl-plugin/plugin.json, tessl.json, .tessl/
Generation runs server-side. The CLI polls until complete. If you press Ctrl-C, the job keeps running - check on it later with tessl scenario list.
Output
When generation completes, the CLI prints a table with the Scenario ID (the ID you pass to tessl scenario download) and either the source commit or PR:
Each commit or PR produces its own generation with its own Scenario ID. Keep these IDs handy — you'll need them in Step 4 if you passed multiple commits or PRs.
Step 3: (Optional) Review the generation
Applies only if you used tessl scenario generate in Step 2.
Before downloading, you can inspect what was generated.
tessl scenario list shows a table of ID, Workspace, Status, Created By, and Created. tessl scenario view shows metadata and a table of generated scenarios with titles and checklist item counts.
Step 4: (Optional) Download scenarios to disk
Applies only if you used tessl scenario generate in Steps 1–2. If you wrote your own scenarios by hand, skip to Step 5.
Or with a specific ID:
--last downloads scenarios from the single most recent generation. If you passed multiple commits to tessl scenario generate, each commit produces its own generation with its own ID — --last will only get the most recent one. Use tessl scenario list to find the other IDs and download each separately.
Flags:
--last
Download from the most recent generation
--output / -o
Output directory (default: evals)
--strategy / -s
merge (default) adds alongside existing scenarios; replace clears the directory first
What lands on disk:
You can edit task.md and criteria.json before running — your edits are picked up at run time. See File formats below.
Step 5: Create a project
Before you run an eval for the first time, you'll need ensure a Tessl project is created and linked.
From the directory you want to run your evals, run the following:
This will then update your tessl.json to specify that the project now belongs to this provided workspace:
Step 6: Run the eval
CLI-triggered eval runs require this directory to be linked to a Tessl project first. See Projects overview.
If this repository is not linked yet, the CLI may ask you to create or link a project before the run starts. See Manage projects from the CLI.
When this directory is linked to a Tessl project, eval runs stay connected to that project over time. This lets Tessl recognise the same work across repeated runs on the same repository.
The project eval type refers to codebase eval runs attached to a Tessl project rather than treated as isolated local runs.
Run from the parent directory of your scenarios folder:
The CLI auto-detects that this is a codebase eval from the scenario.json fixtures and applies smart defaults:
Agent
claude:claude-sonnet-4-6
--agent=<agent:model>
Context pattern
fixture.exclude from scenario.json
--context-pattern="<globs>"
Context ref
HEAD (latest commit on default branch)
--context-ref=<HEAD|SHA>
Workspace
(none — required)
--workspace=<name>
--workspace is required. If any scenarios in the target directory are missing a fixture.exclude field (e.g. scenarios downloaded before the --context flag was introduced), the command will fail with "No context patterns available — scenario.json is missing fixture.exclude". Regenerate those scenarios with --context or move new scenarios into a subdirectory and point the command at that instead.
Because the context pattern defaults from the fixture, baseline vs with-context runs happen automatically — no extra flags needed.
Running with a specific agent
By default, evals run using claude:claude-sonnet-4-6. You can override this with --agent:
Each --agent creates a separate eval run. To compare models, pass multiple --agent flags:
Supported Claude models:
claude-sonnet-4-6
Default
claude-opus-4-6
Most capable
claude-sonnet-4-5
claude-opus-4-5
claude-haiku-4-5
Fastest, lowest cost
Testing updated context against historical scenarios
To test your latest context files against scenarios that were generated from older commits:
--context-ref=HEAD sources context files from the latest commit on the default branch instead of the commit in fixture.ref. This lets you measure how context improvements affect performance on historical tasks.
Output
Ctrl-C detaches without cancelling — runs continue server-side. The CLI prints each run ID so you can check progress later with tessl eval view <id> or tessl eval list.
Results can vary between runs. Because you're evaluating an AI agent, scores are not fully deterministic — the same scenario run twice may produce slightly different results. Treat scores as signals rather than exact measurements, and expect some run-to-run variance when comparing results.
Checking status and viewing results
tessl eval list
List all eval runs
tessl eval list --mine
Only your runs
tessl eval list --type project
Only codebase eval runs
tessl eval view <id>
Detailed results for a specific run
tessl eval view --last
Detailed results for your most recent run
tessl eval retry <id>
Re-run a failed eval
If you lose a run ID, tessl eval list will find it.
--type project refers to the Tessl project entity, not a generic local folder.
File formats
scenario.json
Generated by tessl scenario download. Declares the commit fixture the eval run starts from.
fixtures.codebase.ref— the parent commit hash (the starting state for the agent)fixtures.codebase.exclude— context patterns stripped for baseline; also used as the default--context-patternat run timefixtures.codebase.repoUrl— full clone URL
For the full schema (including description, include, setup, the directory fixture type, and conventional defaults), see scenario.json in the configuration reference.
task.md
Free-form markdown. This is the only file the agent sees — it has no access to criteria.json. Typically structured with Problem, Expected Behavior, and Acceptance Criteria sections. You can edit this freely before running.
criteria.json
Defines how the agent's solution is scored.
Required fields: context, type ("weighted_checklist"), checklist (array with name, description, max_score).
Checklist categories:
INTENT
Core feature or behavior the change introduces; verifies the solution addresses what the task requests
DESIGN
Architectural or structural choices
MUST_NOT
Things the solution should avoid or never do
MINIMALITY
Appropriate scope of changes — solution does what's needed without overreaching
REUSE
Leveraging existing utilities or patterns rather than reimplementing
INTEGRATION
How the solution connects with existing code
EDGE_CASE
Boundary conditions handled correctly
Scoring: (sum of scores / sum of max_scores) × 100. The LLM grader can award partial credit.
Writing your own scenarios
You can hand-author scenarios without using tessl scenario generate:
fixture.ref should be the parent of the ground truth commit. fixture.exclude defines what gets stripped for baseline and serves as the default --context-pattern.
Quick reference
Last updated

