Writing a Custom Grader

Graders evaluate agent submissions and return scores. CORAL loads a grader from a Python entrypoint string and runs it inside an isolated venv that the framework manages, so your grader's dependencies stay separate from CORAL's own and from the agent's worktree.

For open-ended tasks where you'd rather have an LLM judge score against a rubric than write a programmatic grader, see Rubric Judges. Two reusable judge grader packages ship under examples/ — static and dynamic rubric — and the rest of this page applies unchanged.

Quick start (deprecated path)

The fastest way to scaffold a new task — coral init my-task — drops a task.yaml plus an eval/grader.py file you can edit:

from coral.grader import TaskGrader
from coral.types import ScoreBundle

class Grader(TaskGrader):
    def evaluate(self) -> float | ScoreBundle:
        result = self.run_program("solution.py")
        return float(result.stdout.strip())

This auto-discovery path still works but emits a DeprecationWarning on load. Once your grader is non-trivial, migrate it to a package (next section). The same TaskGrader API applies in both setups.

Recommended: package your grader

Wrap your grader in a small Python package so CORAL can install it via uv pip install into the grader venv. The repo's examples/erdos/grader/ is a minimal reference.

Layout

my-task/
├── task.yaml
├── seed/
│   └── solution.py
└── grader/
    ├── pyproject.toml
    └── src/my_task_grader/
        ├── __init__.py        # from .grader import Grader
        └── grader.py          # class Grader(TaskGrader): ...

pyproject.toml:

[project]
name = "my-task-grader"
version = "0.1.0"
requires-python = ">=3.11"
dependencies = ["coral", "numpy"]

[build-system]
requires = ["hatchling"]
build-backend = "hatchling.build"

[tool.hatch.build.targets.wheel]
packages = ["src/my_task_grader"]

Wire it up in `task.yaml`

grader:
  entrypoint: "my_task_grader.grader:Grader"
  setup:
    - "uv pip install -e ./grader"
  timeout: 300
  direction: maximize
  args:
    program_file: "solution.py"

When coral start (or coral validate) runs, it:

Creates .coral/private/grader_venv/ via uv venv.
Editable-installs coral into that venv so your package's coral dependency resolves locally.
Runs every shell command under grader.setup with VIRTUAL_ENV / PATH pointed at the grader venv.
Spawns a worker subprocess in that venv whenever an attempt needs grading and ships the result back as JSON.

The venv lives inside .coral/private/, which is denied to agents by the worktree permission rules — your grader source stays hidden.

TaskGrader helpers

TaskGrader provides several helper methods. They behave the same whether your grader runs from a package or from eval/grader.py.

`run_program(filename, *args, timeout=300)`

Run a file from the agent's codebase:

result = self.run_program("solution.py", "--input", "data.csv")
print(result.stdout)    # captured stdout
print(result.stderr)    # captured stderr
print(result.returncode)  # exit code

`read_eval(relative_path)` / `read_eval_path(relative_path)`

These read from the legacy private eval/ directory. When your grader ships as a package, prefer importlib.resources to load bundled data:

import importlib.resources

data_dir = str(importlib.resources.files("my_task_grader.data"))

`score(value, explanation="")`

Return a scored result with optional feedback:

return self.score(0.85, "Runtime: 1.2s (target: < 1.0s)")

`fail(explanation="")`

Return a failed evaluation (null score):

if result.returncode != 0:
    return self.fail(f"Program crashed: {result.stderr[:200]}")

Available context

Inside evaluate(), you have access to:

Attribute	Description
`self.codebase_path`	Absolute path to the agent's worktree
`self.private_dir`	Absolute path to `.coral/private/`
`self.args`	Dict of extra arguments from `grader.args` in config

Handling errors

class Grader(TaskGrader):
    def evaluate(self) -> float | ScoreBundle:
        result = self.run_program("solution.py")

        if result.returncode != 0:
            return self.fail(f"Crashed: {result.stderr[:300]}")

        try:
            output = float(result.stdout.strip())
        except ValueError:
            return self.fail(f"Invalid output: {result.stdout[:100]}")

        if output < 0:
            return self.fail(f"Score must be non-negative, got {output}")

        return self.score(output, f"Score: {output:.4f}")

Score direction

By default, higher scores are better. For tasks where lower is better:

grader:
  direction: minimize

Private files

Files listed in grader.private are copied to .coral/private/ and hidden from agents. Use this for test data, expected outputs, or reference implementations that don't fit inside your grader package:

grader:
  private:
    - "answers/test_cases.json"
    - "answers/reference_output.txt"

Access them in your grader via self.private_dir:

expected = (Path(self.private_dir) / "answers" / "test_cases.json").read_text()

Grader timeout

grader:
  timeout: 600   # 10 minutes

If the grader exceeds this timeout, the worker subprocess is killed and the attempt is recorded as a failure with a "timed out" explanation.

Validating your grader

coral validate my-task

This builds the grader venv, runs grader.setup, then evaluates the seed code against your grader and prints the result. Fix any issues before running coral start.

Migrating from `grader.type` / `grader.module`

These fields have been removed. Tasks that still set them will fail to load with a clear migration error. Replace the legacy block:

# old (no longer accepted)
grader:
  type: function
  module: my_module
  args: {func_name: grade}

with the entrypoint form:

grader:
  entrypoint: "my_pkg.grader:Grader"
  setup:
    - "uv pip install -e ./grader"

If you were using FunctionGrader to wrap a plain callable, rewrite it as a small TaskGrader subclass and ship it inside your grader package.

Writing a Custom Grader

On this page