Writing a Custom Grader
Build evaluation logic for your CORAL tasks.
Graders evaluate agent submissions and return scores. CORAL loads a grader from a Python entrypoint string and runs it inside an isolated venv that the framework manages, so your grader's dependencies stay separate from CORAL's own and from the agent's worktree.
For open-ended tasks where you'd rather have an LLM judge score against a rubric than write a programmatic grader, see Rubric Judges. Two reusable judge grader packages ship under
examples/— static and dynamic rubric — and the rest of this page applies unchanged.
Quick start
The fastest way to scaffold a new task — coral init my-task — drops a
task.yaml plus a packaged grader stub you can edit:
from coral.grader import TaskGrader
from coral.types import ScoreBundle
class Grader(TaskGrader):
def evaluate(self) -> float | ScoreBundle:
result = self.run_program("solution.py")
return float(result.stdout.strip())The grader ships as a small Python package so CORAL can install it via
uv pip install into the grader venv. The repo's examples/erdos/grader/
is a minimal reference.
Layout
my-task/
├── task.yaml
├── seed/
│ └── solution.py
└── grader/
├── pyproject.toml
└── src/my_task_grader/
├── __init__.py # from .grader import Grader
└── grader.py # class Grader(TaskGrader): ...pyproject.toml:
[project]
name = "my-task-grader"
version = "0.1.0"
requires-python = ">=3.11"
dependencies = ["coral", "numpy"]
[build-system]
requires = ["hatchling"]
build-backend = "hatchling.build"
[tool.hatch.build.targets.wheel]
packages = ["src/my_task_grader"]Wire it up in task.yaml
grader:
entrypoint: "my_task_grader.grader:Grader"
setup:
- "uv pip install -e ./grader"
timeout: 300
direction: maximize
args:
program_file: "solution.py"When coral start (or coral validate) runs, it:
- Creates
.coral/private/grader_venv/viauv venv. - Editable-installs
coralinto that venv so your package'scoraldependency resolves locally. - Runs every shell command under
grader.setupwithVIRTUAL_ENV/PATHpointed at the grader venv. - Spawns a worker subprocess in that venv whenever an attempt needs grading and ships the result back as JSON.
The venv lives inside .coral/private/, which is denied to agents by the
worktree permission rules — your grader source stays hidden.
TaskGrader helpers
TaskGrader provides several helper methods.
run_program(filename, *args, timeout=300)
Run a file from the agent's codebase:
result = self.run_program("solution.py", "--input", "data.csv")
print(result.stdout) # captured stdout
print(result.stderr) # captured stderr
print(result.returncode) # exit codeHidden data files
Test data, answer keys, and helper modules ship inside the grader package —
the convention is a taskdata/ directory next to grader.py:
from pathlib import Path
_TASKDATA = Path(__file__).parent / "taskdata"
answers = _TASKDATA / "answers" / "test_labels.npz"This works for editable and wheel installs alike, and hatchling bundles
non-Python files under packages automatically. For files that can't live
in the package (very large datasets, files shared across tasks), list them
under grader.private in task.yaml — they are copied to
.coral/private/<name> and readable via self.private_dir.
score(value, explanation="")
Return a scored result with optional feedback:
return self.score(0.85, "Runtime: 1.2s (target: < 1.0s)")fail(explanation="")
Return a failed evaluation (null score):
if result.returncode != 0:
return self.fail(f"Program crashed: {result.stderr[:200]}")Available context
Inside evaluate(), you have access to:
| Attribute | Description |
|---|---|
self.codebase_path | Absolute path to the agent's worktree |
self.private_dir | Absolute path to .coral/private/ |
self.args | Dict of extra arguments from grader.args in config |
Handling errors
class Grader(TaskGrader):
def evaluate(self) -> float | ScoreBundle:
result = self.run_program("solution.py")
if result.returncode != 0:
return self.fail(f"Crashed: {result.stderr[:300]}")
try:
output = float(result.stdout.strip())
except ValueError:
return self.fail(f"Invalid output: {result.stdout[:100]}")
if output < 0:
return self.fail(f"Score must be non-negative, got {output}")
return self.score(output, f"Score: {output:.4f}")Score direction
By default, higher scores are better. For tasks where lower is better:
grader:
direction: minimizePrivate files
Files listed in grader.private are copied to .coral/private/ and hidden
from agents. Use this for test data, expected outputs, or reference
implementations that don't fit inside your grader package:
grader:
private:
- "answers/test_cases.json"
- "answers/reference_output.txt"Access them in your grader via self.private_dir:
expected = (Path(self.private_dir) / "answers" / "test_cases.json").read_text()Grader timeout
grader:
timeout: 600 # 10 minutesIf the grader exceeds this timeout, the worker subprocess is killed and the attempt is recorded as a failure with a "timed out" explanation.
Validating your grader
coral validate my-taskThis builds the grader venv, runs grader.setup, then evaluates the seed
code against your grader and prints the result. Fix any issues before
running coral start.
Migrating from grader.type / grader.module
These fields have been removed. Tasks that still set them will fail to load with a clear migration error. Replace the legacy block:
# old (no longer accepted)
grader:
type: function
module: my_module
args: {func_name: grade}with the entrypoint form:
grader:
entrypoint: "my_pkg.grader:Grader"
setup:
- "uv pip install -e ./grader"If you were using FunctionGrader to wrap a plain callable, rewrite it as a
small TaskGrader subclass and ship it inside your grader package.
