Eval Loop
How grading, scoring, and feedback work in CORAL.
The eval loop is CORAL's core mechanism: agents commit changes, a centralized grader daemon scores them asynchronously, and agents use the feedback to guide their next iteration.
How it works
The eval pipeline uses an async grader daemon architecture. Agents submit eval requests by committing code and writing a pending attempt record; a single daemon process picks up pending attempts, grades each one in an isolated git worktree, and writes back the results.
When an agent runs coral eval -m "description":
- Stage —
git add -Astages all changes - Commit — Creates a commit with the provided message
- Submit — Writes a pending attempt JSON to
.coral/public/attempts/ - Wait — Polls the attempt file until the grader daemon fills in the score
- Report — Shows the score and feedback to the agent
Meanwhile, the grader daemon:
- Detect — Polls
.coral/public/attempts/for pending entries - Checkout — Creates an isolated
git worktreeat the attempt's commit hash - Grade — Runs the grader in a subprocess with a hard timeout
- Compare — Determines status (improved, baseline, regressed, etc.)
- Write back — Atomically updates the attempt JSON with score, status, and feedback
- Cleanup — Removes the temporary worktree
Agent makes changes
│
▼
coral eval -m "Optimized inner loop"
│
├── git add -A
├── git commit -m "Optimized inner loop"
├── Write pending attempt JSON
└── Poll attempt file...
Grader daemon (separate process)
┌──────────────────────────────┐
│ Detect pending attempt │
│ git worktree add --detach │
│ Run grader → score = 0.85 │
│ Compare with previous (0.72) │
│ Status: "improved" │
│ Write score back to JSON │
│ Remove worktree │
└──────────────────────────────┘
┌── Score available!
▼
Agent sees: "Score: 0.85 (improved)"Grader daemon
The grader daemon is a single long-running process spawned by coral start (or coral resume) before any agents are launched. It runs for the lifetime of the CORAL session.
Design invariants
- Serial processing — Attempts are graded one at a time, oldest first. Most graders are not concurrency-safe (Docker port conflicts, GPU contention, shared scratch dirs).
- Isolated worktrees — Each attempt is graded in a temporary
git worktree add --detach <commit>checkout under.coral/private/grader_checkouts/. This ensures agent commits during grading do not perturb the codebase the grader sees. - Atomic writes — Attempt files are updated via tmp-file + rename, so agents polling for results never see partial writes.
- Idempotent — Re-encountering an already-scored attempt is a no-op.
- Subprocess isolation — The grader itself runs in a child process (
multiprocessing.Process) for hard-kill timeout semantics.asyncio.wait_forcannot interrupt blocking code (numpy, Docker calls, etc.), butSIGKILLcan.
Multi-agent throughput
With multiple agents, all eval submissions flow through the same daemon. Since grading is serial, a backlog can form if agents submit faster than the daemon grades. Agents block (polling the attempt file) until their result appears. The daemon processes pending attempts in FIFO order, so no agent is starved.
Lifecycle
| Event | What happens |
|---|---|
coral start | Daemon spawned as multiprocessing.Process; PID written to .coral/public/grader_daemon.pid |
coral resume | Stale daemon killed (if PID file exists), then a fresh daemon is spawned |
coral stop | Daemon signaled via multiprocessing.Event, then SIGTERM/SIGKILL as fallback |
| Crash recovery | Stale worktrees are force-removed on next grade attempt |
Scoring
Scores are numeric values. The direction config controls what "better" means:
grader:
direction: maximize # Higher is better (default)
direction: minimize # Lower is betterScore comparison
Each agent tracks its own best score. Status is determined by comparing the new score against that agent's personal best:
| Comparison | Status |
|---|---|
| Better than previous best | improved |
| Equal to previous best | baseline |
| Worse than previous best | regressed |
Feedback
Graders can provide feedback through score explanations:
class Grader(TaskGrader):
def evaluate(self) -> ScoreBundle:
runtime = measure_runtime()
return self.score(
value=1.0 / runtime,
explanation=f"Runtime: {runtime:.2f}s"
)The explanation is included in the eval output.
Timeouts
Graders have a configurable timeout (default: 300 seconds):
grader:
timeout: 600 # 10 minutes
timeout: 0 # No limitIf a grader exceeds the timeout, the daemon kills the grader subprocess via SIGKILL, records the attempt with status: "timeout" and a null score, and cleans up the worktree. The agent sees feedback like "Eval timed out after 600s."
The agent-side poll also has its own timeout (2x the grader timeout + 60s slack, minimum 300s). If the daemon hasn't finalized the attempt within this window, the agent receives a STILL PENDING message and can retry with coral wait <hash>.
Heartbeat actions
Heartbeat actions are periodic tasks triggered by the eval counter:
Reflect (default: every 1 eval, per-agent)
After each eval, the agent reviews its progress and decides whether to continue the current approach or pivot.
Consolidate (default: every 10 evals, global)
A periodic knowledge-sharing step where agents write notes about their findings, helping other agents learn from their experience.
Custom actions
Define your own heartbeat actions via CLI:
coral heartbeat set review --every 5 --prompt "Review alternative approaches"Or in task.yaml:
agents:
heartbeat:
- name: reflect
every: 1
- name: consolidate
every: 10
global: trueGlobal eval count
The file .coral/public/eval_count tracks the total number of evals across all agents. Heartbeat actions with global: true use this counter, while per-agent actions use each agent's individual count.
