← Back to ts-bench results

ts-bench v2: Benchmarking Coding Agents on Real TypeScript Monorepo Tasks

ts-bench is a benchmark CLI for comparing AI coding agents on TypeScript workloads.

Version 1 used small, self-contained Exercism TypeScript exercises. That was useful at first, but modern coding models and agents have largely outgrown that level of difficulty. Many top systems can now solve those tasks perfectly, which makes the benchmark less useful for comparison.

ts-bench v2 moves to a harder target: real-world SWE-Lancer tasks from Expensify, a large React Native / TypeScript monorepo.

The goal is not to produce a definitive universal leaderboard. The goal is more practical:

Measure the actual agent/model combinations people use in daily development — Claude Code, Codex CLI, Copilot CLI, Gemini CLI, opencode, and others — on realistic TypeScript tasks.

Agent Harness vs. Evaluation Harness

In this project, I use two different meanings of “harness.”

An agent harness is the runtime that sends prompts to an LLM, calls tools, edits files, and runs commands. Examples include Claude Code, Codex CLI, Copilot CLI, Nano Code, and opencode.

An evaluation harness is the system that defines tasks, prepares the environment, runs the agent, executes tests, and aggregates results.

ts-bench is an evaluation harness. It lets you swap different agent harnesses and compare them under the same task setup.

This distinction matters because model performance and agent-harness performance are not the same thing. The same model can behave differently depending on the tool loop, prompt, file context, authentication method, and execution environment around it.


Why v2?

The original v1 benchmark was based on Exercism TypeScript exercises. These are small programming tasks with clear tests. They are easy to run and useful for early comparisons, but they eventually became too easy.

This is not unique to ts-bench. Other benchmarks based on small programming exercises have also faced saturation. Once many systems reach perfect or near-perfect scores, the benchmark can no longer separate practical differences between agents.

For v2, ts-bench uses SWE-Lancer tasks instead.

SWE-Lancer is based on real software-engineering tasks from Expensify. The public offline IC SWE Diamond set contains 198 tasks. These tasks are closer to the kind of work coding agents are actually used for: fixing UI behavior, validation logic, chat interactions, markdown rendering, focus handling, and other application-level issues in a large codebase.

For the first v2 snapshot, I used a 5-task default set with a total listed reward of $20,000.

Task IDRewardDifficulty
14958$8,000Hard
15815_1$4,000Medium
15193$4,000Medium
14268$2,000Easy-Med
20079$2,000Easy-Med

These rewards are not used as the benchmark score. They are used as a rough signal that these are non-trivial real-world tasks.

SWE-Lancer task browser
SWE-Lancer task browser

How v2 Works

ts-bench v2 runs SWE-Lancer tasks inside Docker.

At a high level, each run does the following:

  1. 1. Start the SWE-Lancer environment.
  2. 2. Apply the task’s bug_reintroduce.patch to create the broken starting state.
  3. 3. Run the selected coding agent against the task.
  4. 4. Run the official test harness.
  5. 5. Store the result as JSON and publish it to the results site.

A typical command looks like this:

bun src/index.ts \
  --dataset v2 \
  --docker \
  --tasks 14958,15815_1,15193,14268,20079 \
  --agent codex \
  --model gpt-5.4

The important design goal is reproducibility. I want users to be able to run the benchmark with their own accounts, subscriptions, API keys, models, and agent configurations — not just read a static leaderboard produced by someone else.


April 2026 Snapshot

The current v2 results are only an early snapshot. Models, agents, prompts, and the evaluation code will continue to change.

Still, the first run already shows why measuring real agent/model combinations is useful.

ts-bench v2 tier list
ts-bench v2 tier list
Agent / ModelTierSolvedTime
copilot / claude-sonnet-4.6A4/564.8 min
claude / claude-sonnet-4-6B3/548.1 min
codex / gpt-5.4B3/526.2 min
gemini / gemini-3.1-pro-previewB3/570.6 min
copilot / claude-opus-4.6C2/567.6 min
gemini / gemini-2.5-flashC2/518.8 min
claude / claude-haiku-4-5D1/534.4 min
claude / claude-opus-4-6D1/531.2 min
gemini / gemini-3-flash-previewD1/553.4 min
codex / gpt-5.4-miniF0/528.1 min

A few observations stand out:

These are not final conclusions. They are reasons to keep measuring.

A per-agent result page showing task-level pass/fail details
A per-agent result page showing task-level pass/fail details

Why Tiers Instead of a Single Ranking?

A small number of tasks is too noisy for a precise ranking. One task can move a model up or down.

For practical use, tiers are more useful than exact rank order. I mostly want to know whether an agent/model combination is:

The long-term goal is to increase the sample size and report more stable tier distributions.


Why This Costs Money

v2 is much more expensive to run than v1.

The benchmark now involves:

Even the small 5-task default set takes tens of minutes per agent/model combination. Expanding the benchmark to more tasks, more agents, and more models makes the cost grow quickly.

Some of the next engineering tasks are also about reducing this cost:


Support the Project

ts-bench v2 is an ongoing project. I want to keep updating the leaderboard, add more agents and models, and increase the number of tasks so the results become more useful.

If you find this work useful, please consider sponsoring it.

👉 Sponsor the project: https://github.com/sponsors/laiso

Related links: