CORALCORAL
Guides

Benchmarks

Running CORAL on standard coding benchmarks.

CORAL can be used to run agents on standardized benchmarks. This guide covers setup and execution.

Available benchmark examples

The examples/ directory includes task configurations for several benchmarks:

TaskDescription
circle_packingGeometric optimization — pack circles in a unit square
erdosMath conjecture exploration
kernel_builderVLIW SIMD kernel optimization
kernel_engineeringGPU kernel optimization
mnistML classification
spaceship_titanicKaggle competition (classification)
stanford_covid_vaccinemRNA degradation prediction
swebench-verifiedMeta-solver optimization over SWE-bench Verified (Harbor)
terminal-benchMeta-solver optimization over terminal-bench (Harbor)

Running a benchmark

# Example: circle packing optimization
coral start -c examples/circle_packing/task.yaml

# With more agents
coral start -c examples/circle_packing/task.yaml --agents 4

# With web dashboard
coral start -c examples/circle_packing/task.yaml --ui

Writing benchmark graders

Benchmark graders follow the same TaskGrader pattern. For example, the circle packing grader:

  1. Runs the agent's initial_program.py
  2. Verifies constraints (circles within bounds, no overlaps)
  3. Computes sum_radii / best_known_result as the score

See Writing a Custom Grader for the full guide.

Harbor-based benchmarks (SWE-bench, terminal-bench)

SWE-bench Verified and terminal-bench are evaluated via Harbor, which runs each instance inside its own Docker container. CORAL does not ship them as Python extras — they are regular example tasks that shell out to uvx harbor run from their grader.

Docker-in-Docker is not supported. Harbor spawns Docker containers, so CORAL itself must run on the host machine (not inside a container) when using these benchmarks.

Meta-solver pattern

Both tasks follow a meta-solver optimization pattern: agents iteratively improve a solve.py (a Terminus2-based Harbor agent) to maximize the pass rate across the benchmark's instances. The seed directory already contains a working baseline.

Tiered evaluation

Every coral eval runs one tier based on the agent's best previous score, which avoids redundantly re-running cheap tiers:

TierInstancesAdvances when best score ≥
150.3
2300.7
3all

Tier sizes and thresholds are configurable via grader.args.tier1_size, tier1_threshold, tier2_size, tier2_threshold in the task's task.yaml.

Running

# SWE-bench Verified (~500 instances at tier 3)
coral start -c examples/swebench-verified/task.yaml

# terminal-bench
coral start -c examples/terminal-bench/task.yaml

Each run writes detailed agent trajectories, terminal recordings, and verifier output to harbor_logs/ in the worktree — useful for understanding why specific instances failed.

Local smoke test

To test the baseline solver outside of CORAL:

uvx harbor run -d terminal-bench@2.0 \
  --agent-import-path "solve:SolverAgent" \
  -l 5