Writing a Custom Grader
Build evaluation logic for your CORAL tasks.
Graders evaluate agent submissions and return scores. CORAL loads a grader from a Python entrypoint string and runs it inside an isolated venv that the framework manages, so your grader's dependencies stay separate from CORAL's own and from the agent's worktree.
For open-ended tasks where you'd rather have an LLM judge score against a rubric than write a programmatic grader, see Rubric Judges. Two reusable judge grader packages ship under
examples/— static and dynamic rubric — and the rest of this page applies unchanged.
Quick start (deprecated path)
The fastest way to scaffold a new task — coral init my-task — drops a
task.yaml plus an eval/grader.py file you can edit:
from coral.grader import TaskGrader
from coral.types import ScoreBundle
class Grader(TaskGrader):
def evaluate(self) -> float | ScoreBundle:
result = self.run_program("solution.py")
return float(result.stdout.strip())This auto-discovery path still works but emits a DeprecationWarning on
load. Once your grader is non-trivial, migrate it to a package (next
section). The same TaskGrader API applies in both setups.
Recommended: package your grader
Wrap your grader in a small Python package so CORAL can install it via
uv pip install into the grader venv. The repo's examples/erdos/grader/
is a minimal reference.
Layout
my-task/
├── task.yaml
├── seed/
│ └── solution.py
└── grader/
├── pyproject.toml
└── src/my_task_grader/
├── __init__.py # from .grader import Grader
└── grader.py # class Grader(TaskGrader): ...pyproject.toml:
[project]
name = "my-task-grader"
version = "0.1.0"
requires-python = ">=3.11"
dependencies = ["coral", "numpy"]
[build-system]
requires = ["hatchling"]
build-backend = "hatchling.build"
[tool.hatch.build.targets.wheel]
packages = ["src/my_task_grader"]Wire it up in task.yaml
grader:
entrypoint: "my_task_grader.grader:Grader"
setup:
- "uv pip install -e ./grader"
timeout: 300
direction: maximize
args:
program_file: "solution.py"When coral start (or coral validate) runs, it:
- Creates
.coral/private/grader_venv/viauv venv. - Editable-installs
coralinto that venv so your package'scoraldependency resolves locally. - Runs every shell command under
grader.setupwithVIRTUAL_ENV/PATHpointed at the grader venv. - Spawns a worker subprocess in that venv whenever an attempt needs grading and ships the result back as JSON.
The venv lives inside .coral/private/, which is denied to agents by the
worktree permission rules — your grader source stays hidden.
TaskGrader helpers
TaskGrader provides several helper methods. They behave the same whether
your grader runs from a package or from eval/grader.py.
run_program(filename, *args, timeout=300)
Run a file from the agent's codebase:
result = self.run_program("solution.py", "--input", "data.csv")
print(result.stdout) # captured stdout
print(result.stderr) # captured stderr
print(result.returncode) # exit coderead_eval(relative_path) / read_eval_path(relative_path)
These read from the legacy private eval/ directory. When your grader
ships as a package, prefer importlib.resources to load bundled data:
import importlib.resources
data_dir = str(importlib.resources.files("my_task_grader.data"))score(value, explanation="")
Return a scored result with optional feedback:
return self.score(0.85, "Runtime: 1.2s (target: < 1.0s)")fail(explanation="")
Return a failed evaluation (null score):
if result.returncode != 0:
return self.fail(f"Program crashed: {result.stderr[:200]}")Available context
Inside evaluate(), you have access to:
| Attribute | Description |
|---|---|
self.codebase_path | Absolute path to the agent's worktree |
self.private_dir | Absolute path to .coral/private/ |
self.args | Dict of extra arguments from grader.args in config |
Handling errors
class Grader(TaskGrader):
def evaluate(self) -> float | ScoreBundle:
result = self.run_program("solution.py")
if result.returncode != 0:
return self.fail(f"Crashed: {result.stderr[:300]}")
try:
output = float(result.stdout.strip())
except ValueError:
return self.fail(f"Invalid output: {result.stdout[:100]}")
if output < 0:
return self.fail(f"Score must be non-negative, got {output}")
return self.score(output, f"Score: {output:.4f}")Score direction
By default, higher scores are better. For tasks where lower is better:
grader:
direction: minimizePrivate files
Files listed in grader.private are copied to .coral/private/ and hidden
from agents. Use this for test data, expected outputs, or reference
implementations that don't fit inside your grader package:
grader:
private:
- "answers/test_cases.json"
- "answers/reference_output.txt"Access them in your grader via self.private_dir:
expected = (Path(self.private_dir) / "answers" / "test_cases.json").read_text()Grader timeout
grader:
timeout: 600 # 10 minutesIf the grader exceeds this timeout, the worker subprocess is killed and the attempt is recorded as a failure with a "timed out" explanation.
Validating your grader
coral validate my-taskThis builds the grader venv, runs grader.setup, then evaluates the seed
code against your grader and prints the result. Fix any issues before
running coral start.
Migrating from grader.type / grader.module
These fields have been removed. Tasks that still set them will fail to load with a clear migration error. Replace the legacy block:
# old (no longer accepted)
grader:
type: function
module: my_module
args: {func_name: grade}with the entrypoint form:
grader:
entrypoint: "my_pkg.grader:Grader"
setup:
- "uv pip install -e ./grader"If you were using FunctionGrader to wrap a plain callable, rewrite it as a
small TaskGrader subclass and ship it inside your grader package.
