Benchmarks
Running CORAL on standard coding benchmarks.
CORAL can be used to run agents on standardized benchmarks. This guide covers setup and execution.
Available benchmark examples
The examples/ directory includes task configurations for several benchmarks:
| Task | Description |
|---|---|
circle_packing | Geometric optimization — pack circles in a unit square |
erdos | Math conjecture exploration |
kernel_builder | VLIW SIMD kernel optimization |
kernel_engineering | GPU kernel optimization |
mnist | ML classification |
spaceship_titanic | Kaggle competition (classification) |
stanford_covid_vaccine | mRNA degradation prediction |
swebench-verified | Meta-solver optimization over SWE-bench Verified (Harbor) |
terminal-bench | Meta-solver optimization over terminal-bench (Harbor) |
Running a benchmark
# Example: circle packing optimization
coral start -c examples/circle_packing/task.yaml
# With more agents
coral start -c examples/circle_packing/task.yaml --agents 4
# With web dashboard
coral start -c examples/circle_packing/task.yaml --uiWriting benchmark graders
Benchmark graders follow the same TaskGrader pattern. For example, the circle packing grader:
- Runs the agent's
initial_program.py - Verifies constraints (circles within bounds, no overlaps)
- Computes
sum_radii / best_known_resultas the score
See Writing a Custom Grader for the full guide.
Harbor-based benchmarks (SWE-bench, terminal-bench)
SWE-bench Verified and terminal-bench are evaluated via Harbor, which runs each instance inside its own Docker container. CORAL does not ship them as Python extras — they are regular example tasks that shell out to uvx harbor run from their grader.
Docker-in-Docker is not supported. Harbor spawns Docker containers, so CORAL itself must run on the host machine (not inside a container) when using these benchmarks.
Meta-solver pattern
Both tasks follow a meta-solver optimization pattern: agents iteratively improve a solve.py (a Terminus2-based Harbor agent) to maximize the pass rate across the benchmark's instances. The seed directory already contains a working baseline.
Tiered evaluation
Every coral eval runs one tier based on the agent's best previous score, which avoids redundantly re-running cheap tiers:
| Tier | Instances | Advances when best score ≥ |
|---|---|---|
| 1 | 5 | 0.3 |
| 2 | 30 | 0.7 |
| 3 | all | — |
Tier sizes and thresholds are configurable via grader.args.tier1_size, tier1_threshold, tier2_size, tier2_threshold in the task's task.yaml.
Running
# SWE-bench Verified (~500 instances at tier 3)
coral start -c examples/swebench-verified/task.yaml
# terminal-bench
coral start -c examples/terminal-bench/task.yamlEach run writes detailed agent trajectories, terminal recordings, and verifier output to harbor_logs/ in the worktree — useful for understanding why specific instances failed.
Local smoke test
To test the baseline solver outside of CORAL:
uvx harbor run -d terminal-bench@2.0 \
--agent-import-path "solve:SolverAgent" \
-l 5