Evaluate your codebase
This feature is in public beta. Workflows and output formats may change in upcoming releases.
Tessl lets you run end-to-end evals directly against the codebase your skills are meant to help with — no tile required. An agent runs against a set of task scenarios twice: once without your context files (baseline) and once with them injected. The delta shows you how much your skills, rules, and documentation are actually helping the agent.
You can write scenarios by hand, or let Tessl generate them from real commits in your repo. The commit-based workflow is the quickest way to get started with realistic tasks.
How it works
Each eval scenario is solved twice by default:
Baseline — the agent works on the repo with your context files stripped out
With context — the agent works on the repo with your context files injected
Comparing the two scores tells you whether your skills, rules, and documentation are making the agent more effective on real tasks.
Prerequisites
Tessl installed (latest version)
Logged into Tessl
Your GitHub or GitLab account connected in workspace settings
Step 1: (Optional) Browse commits and pick what to evaluate
Scenarios can be written by hand — skip to Step 4 if you already have them. If you want to base scenarios on real tasks from your codebase, use tessl repo select-commits to browse recent commits and choose which ones to turn into scenarios. This command is not listed in tessl --help but works when invoked directly.
Useful flags:
--keyword
--keyword=feat
Filter by commit message keyword
--author
--author="Alice"
Filter by author name
--since / --until
--since=2026-01-01
Date range (YYYY-MM-DD)
--count / -n
--count=20
Number of commits to show (1–100)
--workspace / -w
--workspace=engteam
Required outside interactive mode
Output: a table of Hash | Date | Author | Message. Copy the hashes you want to pass to the next step.
Prerequisite: your GitHub or GitLab account must be connected in workspace settings. If it isn't, the error message includes a direct link to the settings page.
Step 2: (Optional) Generate scenarios from commits
Tessl analyzes the commit diffs and generates a set of task scenarios for each commit.
Flags:
--commits
--commits=abc123,def456
Comma-separated commit hashes (required)
--context
--context="*.mdc,*.md"
Glob patterns identifying your context files
--workspace / -w
--workspace=engteam
Required outside interactive mode
--json
Output generation IDs as JSON without polling
The --context flag
--context tells Tessl which files in your repo are context files — skills, rules, documentation, etc. These patterns are stored in each generated scenario.json as fixture.exclude and serve two purposes:
They are stripped from the repo for the baseline run so the agent works without context
They are injected back for the with-context run so you can measure the delta
When omitted, Tessl defaults to: *.mdc, *.md, tile.json, tessl.json, .tessl/
Generation runs server-side. The CLI polls until complete. If you press Ctrl-C, the job keeps running - check on it later with tessl scenario list.
Step 3: (Optional) Review the generation run
Applies only if you used tessl scenario generate in Step 2.
Before downloading, you can inspect what was generated.
tessl scenario list shows a table of ID, Workspace, Status, Created By, and Created. tessl scenario view shows metadata and a table of generated scenarios with titles and checklist item counts.
Step 4: Download scenarios to disk
Or with a specific ID:
--last downloads scenarios from the single most recent generation run. If you passed multiple commits to tessl scenario generate, each commit produces its own generation run with its own ID — --last will only get the most recent one. Use tessl scenario list to find the other run IDs and download each separately.
Flags:
--last
Download from the most recent generation run
--output / -o
Output directory (default: evals)
--strategy / -s
merge (default) adds alongside existing scenarios; replace clears the directory first
What lands on disk:
You can edit task.md and criteria.json before running — your edits are picked up at run time. See File formats below.
Step 5: Run the eval
Run from the parent directory of your scenarios folder:
The CLI auto-detects that this is a codebase eval from the scenario.json fixtures and applies smart defaults:
Agent
claude:claude-sonnet-4-6
--agent=<agent:model>
Context pattern
fixture.exclude from scenario.json
--context-pattern="<globs>"
Context ref
infer (same commit as fixture)
--context-ref=<infer|HEAD|SHA>
Workspace
(none — required)
--workspace=<name>
--workspace is required. If any scenarios in the target directory are missing a fixture.exclude field (e.g. scenarios downloaded before the --context flag was introduced), the command will fail with "No context patterns available — scenario.json is missing fixture.exclude". Regenerate those scenarios with --context or move new scenarios into a subdirectory and point the command at that instead.
Because the context pattern defaults from the fixture, baseline vs with-context runs happen automatically — no extra flags needed.
Running with multiple agents
Each --agent creates a separate eval run. Supported agents and model examples:
claude
claude:claude-sonnet-4-6, claude:claude-haiku-4-5
cursor
cursor:auto, cursor:composer-1.5
codex
codex:o3
Testing updated context against historical scenarios
To test your latest context files against scenarios that were generated from older commits:
--context-ref=HEAD sources context files from the latest commit on the default branch instead of the commit in fixture.ref. This lets you measure how context improvements affect performance on historical tasks.
Output
Ctrl-C detaches without cancelling — runs continue server-side. The CLI prints each run ID so you can check progress later with tessl eval view <id> or tessl eval list.
Step 6: Compare results
Pass the same scenarios directory used with eval run. The CLI fingerprints your local scenarios and fetches matching results from the server.
With context comparison (the default when fixture.exclude is present):
Score colours: 🟢 ≥ 80% 🟡 ≥ 50% 🔴 < 50%
Use --breakdown to see per-scenario detail:
Checking status and viewing results
tessl eval list
List all eval runs
tessl eval list --mine
Only your runs
tessl eval list --type project
Only codebase eval runs
tessl eval view <id>
Detailed results for a specific run
tessl eval view --last
Detailed results for your most recent run
tessl eval retry <id>
Re-run a failed eval
If you lose a run ID, tessl eval list will find it.
File formats
scenario.json
Generated by tessl scenario download. Defines the fixture for the eval run.
fixture.ref— the parent commit hash (the starting state for the agent)fixture.exclude— context patterns stripped for baseline; also used as the default--context-patternat run timefixture.repoUrl— full clone URL
task.md
Free-form markdown. This is the only file the agent sees — it has no access to criteria.json. Typically structured with Problem, Expected Behavior, and Acceptance Criteria sections. You can edit this freely before running.
criteria.json
Defines how the agent's solution is scored.
Required fields: context, type ("weighted_checklist"), checklist (array with name, description, max_score).
Checklist categories: INTENT · DESIGN · MUST_NOT · MINIMALITY · REUSE
Scoring: (sum of scores / sum of max_scores) × 100. The LLM grader can award partial credit.
Writing your own scenarios
You can hand-author scenarios without using tessl scenario generate:
fixture.ref should be the parent of the ground truth commit. fixture.exclude defines what gets stripped for baseline and serves as the default --context-pattern.
Quick reference
Last updated

