Skip to content

Evaluator

gepa.optimize_anything.Evaluator

Bases: Protocol

Functions

__call__(candidate: str | Candidate, example: object | None = None, **kwargs: Any) -> tuple[float, SideInfo] | float

Score a candidate, returning a score and diagnostic side information (ASI).

This is the function you write. GEPA calls it repeatedly with mutated candidates and collects the returned scores and diagnostics to drive the optimization loop.

Parameters:

Name Type Description Default
candidate str | Candidate

The text parameter(s) to evaluate. Type matches seed_candidate: plain str if you passed a string, or dict[str, str] if you passed a dict.

required
example object | None

One item from dataset. Not passed in single-task search mode (dataset=None).

None
opt_state

(optional) Declare opt_state: OptimizationState in your signature to receive historical best evaluations for warm-starting. This is a reserved parameter name — GEPA injects it automatically.

required

Returns:

Type Description
tuple[float, SideInfo] | float

Either (score, side_info) or just score:

tuple[float, SideInfo] | float
  • score (float): Higher is better.
tuple[float, SideInfo] | float
  • side_info (:data:SideInfo): Diagnostic dict (ASI) powering LLM reflection.
tuple[float, SideInfo] | float

If you return score-only, use oa.log() or

tuple[float, SideInfo] | float

capture_stdio=True to provide diagnostic context.

Evaluator signature per mode:

Single-Task Search (dataset=None) — called without example::

import gepa.optimize_anything as oa

def evaluate(candidate):
    result = run_code(candidate["code"])
    oa.log(f"Output: {result}")      # ASI via oa.log()
    return compute_score(result)

Multi-Task Search / Generalization (dataset provided) — called per example::

def evaluate(candidate, example):
    pred = run(candidate["prompt"], example["input"])
    score = 1.0 if pred == example["expected"] else 0.0
    return score, {"Input": example["input"], "Output": pred}
Reserved side_info keys

"log", "stdout", "stderr" are used by GEPA's automatic capture. If you use them, GEPA stores captured output under "_gepa_log" etc. to avoid collisions.

Source code in gepa/optimize_anything.py
def __call__(
    self, candidate: str | Candidate, example: object | None = None, **kwargs: Any
) -> tuple[float, SideInfo] | float:
    """Score a candidate, returning a score and diagnostic side information (ASI).

    This is the function you write.  GEPA calls it repeatedly with
    mutated candidates and collects the returned scores and diagnostics
    to drive the optimization loop.

    Args:
        candidate: The text parameter(s) to evaluate.  Type matches
            ``seed_candidate``: plain ``str`` if you passed a string,
            or ``dict[str, str]`` if you passed a dict.
        example: One item from ``dataset``.  **Not passed** in
            single-task search mode (``dataset=None``).
        opt_state: *(optional)* Declare ``opt_state: OptimizationState``
            in your signature to receive historical best evaluations
            for warm-starting.  This is a **reserved parameter name**
            — GEPA injects it automatically.

    Returns:
        Either ``(score, side_info)`` or just ``score``:

        - **score** (float): Higher is better.
        - **side_info** (:data:`SideInfo`): Diagnostic dict (ASI)
          powering LLM reflection.

        If you return score-only, use ``oa.log()`` or
        ``capture_stdio=True`` to provide diagnostic context.

    Evaluator signature per mode:

    **Single-Task Search** (``dataset=None``) — called without ``example``::

        import gepa.optimize_anything as oa

        def evaluate(candidate):
            result = run_code(candidate["code"])
            oa.log(f"Output: {result}")      # ASI via oa.log()
            return compute_score(result)

    **Multi-Task Search / Generalization** (``dataset`` provided) — called per example::

        def evaluate(candidate, example):
            pred = run(candidate["prompt"], example["input"])
            score = 1.0 if pred == example["expected"] else 0.0
            return score, {"Input": example["input"], "Output": pred}

    Reserved side_info keys:
        ``"log"``, ``"stdout"``, ``"stderr"`` are used by GEPA's
        automatic capture.  If you use them, GEPA stores captured
        output under ``"_gepa_log"`` etc. to avoid collisions.
    """
    ...