CORALCORAL
Guides

Writing a Custom Grader

Build evaluation logic for your CORAL tasks.

Graders evaluate agent submissions and return scores. CORAL loads a grader from a Python entrypoint string and runs it inside an isolated venv that the framework manages, so your grader's dependencies stay separate from CORAL's own and from the agent's worktree.

For open-ended tasks where you'd rather have an LLM judge score against a rubric than write a programmatic grader, see Rubric Judges. Two reusable judge grader packages ship under examples/ — static and dynamic rubric — and the rest of this page applies unchanged.

Quick start (deprecated path)

The fastest way to scaffold a new task — coral init my-task — drops a task.yaml plus an eval/grader.py file you can edit:

from coral.grader import TaskGrader
from coral.types import ScoreBundle

class Grader(TaskGrader):
    def evaluate(self) -> float | ScoreBundle:
        result = self.run_program("solution.py")
        return float(result.stdout.strip())

This auto-discovery path still works but emits a DeprecationWarning on load. Once your grader is non-trivial, migrate it to a package (next section). The same TaskGrader API applies in both setups.

Wrap your grader in a small Python package so CORAL can install it via uv pip install into the grader venv. The repo's examples/erdos/grader/ is a minimal reference.

Layout

my-task/
├── task.yaml
├── seed/
│   └── solution.py
└── grader/
    ├── pyproject.toml
    └── src/my_task_grader/
        ├── __init__.py        # from .grader import Grader
        └── grader.py          # class Grader(TaskGrader): ...

pyproject.toml:

[project]
name = "my-task-grader"
version = "0.1.0"
requires-python = ">=3.11"
dependencies = ["coral", "numpy"]

[build-system]
requires = ["hatchling"]
build-backend = "hatchling.build"

[tool.hatch.build.targets.wheel]
packages = ["src/my_task_grader"]

Wire it up in task.yaml

grader:
  entrypoint: "my_task_grader.grader:Grader"
  setup:
    - "uv pip install -e ./grader"
  timeout: 300
  direction: maximize
  args:
    program_file: "solution.py"

When coral start (or coral validate) runs, it:

  1. Creates .coral/private/grader_venv/ via uv venv.
  2. Editable-installs coral into that venv so your package's coral dependency resolves locally.
  3. Runs every shell command under grader.setup with VIRTUAL_ENV / PATH pointed at the grader venv.
  4. Spawns a worker subprocess in that venv whenever an attempt needs grading and ships the result back as JSON.

The venv lives inside .coral/private/, which is denied to agents by the worktree permission rules — your grader source stays hidden.

TaskGrader helpers

TaskGrader provides several helper methods. They behave the same whether your grader runs from a package or from eval/grader.py.

run_program(filename, *args, timeout=300)

Run a file from the agent's codebase:

result = self.run_program("solution.py", "--input", "data.csv")
print(result.stdout)    # captured stdout
print(result.stderr)    # captured stderr
print(result.returncode)  # exit code

read_eval(relative_path) / read_eval_path(relative_path)

These read from the legacy private eval/ directory. When your grader ships as a package, prefer importlib.resources to load bundled data:

import importlib.resources

data_dir = str(importlib.resources.files("my_task_grader.data"))

score(value, explanation="")

Return a scored result with optional feedback:

return self.score(0.85, "Runtime: 1.2s (target: < 1.0s)")

fail(explanation="")

Return a failed evaluation (null score):

if result.returncode != 0:
    return self.fail(f"Program crashed: {result.stderr[:200]}")

Available context

Inside evaluate(), you have access to:

AttributeDescription
self.codebase_pathAbsolute path to the agent's worktree
self.private_dirAbsolute path to .coral/private/
self.argsDict of extra arguments from grader.args in config

Handling errors

class Grader(TaskGrader):
    def evaluate(self) -> float | ScoreBundle:
        result = self.run_program("solution.py")

        if result.returncode != 0:
            return self.fail(f"Crashed: {result.stderr[:300]}")

        try:
            output = float(result.stdout.strip())
        except ValueError:
            return self.fail(f"Invalid output: {result.stdout[:100]}")

        if output < 0:
            return self.fail(f"Score must be non-negative, got {output}")

        return self.score(output, f"Score: {output:.4f}")

Score direction

By default, higher scores are better. For tasks where lower is better:

grader:
  direction: minimize

Private files

Files listed in grader.private are copied to .coral/private/ and hidden from agents. Use this for test data, expected outputs, or reference implementations that don't fit inside your grader package:

grader:
  private:
    - "answers/test_cases.json"
    - "answers/reference_output.txt"

Access them in your grader via self.private_dir:

expected = (Path(self.private_dir) / "answers" / "test_cases.json").read_text()

Grader timeout

grader:
  timeout: 600   # 10 minutes

If the grader exceeds this timeout, the worker subprocess is killed and the attempt is recorded as a failure with a "timed out" explanation.

Validating your grader

coral validate my-task

This builds the grader venv, runs grader.setup, then evaluates the seed code against your grader and prints the result. Fix any issues before running coral start.

Migrating from grader.type / grader.module

These fields have been removed. Tasks that still set them will fail to load with a clear migration error. Replace the legacy block:

# old (no longer accepted)
grader:
  type: function
  module: my_module
  args: {func_name: grade}

with the entrypoint form:

grader:
  entrypoint: "my_pkg.grader:Grader"
  setup:
    - "uv pip install -e ./grader"

If you were using FunctionGrader to wrap a plain callable, rewrite it as a small TaskGrader subclass and ship it inside your grader package.