Evaluator¶

`gepa.optimize_anything.Evaluator` ¶

Bases: Protocol

Functions¶

`call(candidate: str | Candidate, example: object | None = None, **kwargs: Any) -> tuple[float, SideInfo] | float` ¶

Score a candidate, returning a score and diagnostic side information (ASI).

This is the function you write. GEPA calls it repeatedly with mutated candidates and collects the returned scores and diagnostics to drive the optimization loop.

Parameters:

Name	Type	Description	Default
`candidate`	`str \| Candidate`	The text parameter(s) to evaluate. Type matches `seed_candidate`: plain `str` if you passed a string, or `dict[str, str]` if you passed a dict.	required
`example`	`object \| None`	One item from `dataset`. Not passed in single-task search mode (`dataset=None`).	`None`
`opt_state`		(optional) Declare `opt_state: OptimizationState` in your signature to receive historical best evaluations for warm-starting. This is a reserved parameter name — GEPA injects it automatically.	required

Returns:

Type	Description
`tuple[float, SideInfo] \| float`	Either `(score, side_info)` or just `score`:
`tuple[float, SideInfo] \| float`	score (float): Higher is better.
`tuple[float, SideInfo] \| float`	side_info (:data:`SideInfo`): Diagnostic dict (ASI) powering LLM reflection.
`tuple[float, SideInfo] \| float`	If you return score-only, use `oa.log()` or
`tuple[float, SideInfo] \| float`	`capture_stdio=True` to provide diagnostic context.

Evaluator signature per mode:

Single-Task Search (dataset=None) — called without example::

import gepa.optimize_anything as oa

def evaluate(candidate):
    result = run_code(candidate["code"])
    oa.log(f"Output: {result}")      # ASI via oa.log()
    return compute_score(result)

Multi-Task Search / Generalization (dataset provided) — called per example::

def evaluate(candidate, example):
    pred = run(candidate["prompt"], example["input"])
    score = 1.0 if pred == example["expected"] else 0.0
    return score, {"Input": example["input"], "Output": pred}

Reserved side_info keys

"log", "stdout", "stderr" are used by GEPA's automatic capture. If you use them, GEPA stores captured output under "_gepa_log" etc. to avoid collisions.

Source code in gepa/optimize_anything.py

def __call__(
    self, candidate: str | Candidate, example: object | None = None, **kwargs: Any
) -> tuple[float, SideInfo] | float:
    """Score a candidate, returning a score and diagnostic side information (ASI).

    This is the function you write.  GEPA calls it repeatedly with
    mutated candidates and collects the returned scores and diagnostics
    to drive the optimization loop.

    Args:
        candidate: The text parameter(s) to evaluate.  Type matches
            ``seed_candidate``: plain ``str`` if you passed a string,
            or ``dict[str, str]`` if you passed a dict.
        example: One item from ``dataset``.  **Not passed** in
            single-task search mode (``dataset=None``).
        opt_state: *(optional)* Declare ``opt_state: OptimizationState``
            in your signature to receive historical best evaluations
            for warm-starting.  This is a **reserved parameter name**
            — GEPA injects it automatically.

    Returns:
        Either ``(score, side_info)`` or just ``score``:

        - **score** (float): Higher is better.
        - **side_info** (:data:`SideInfo`): Diagnostic dict (ASI)
          powering LLM reflection.

        If you return score-only, use ``oa.log()`` or
        ``capture_stdio=True`` to provide diagnostic context.

    Evaluator signature per mode:

    **Single-Task Search** (``dataset=None``) — called without ``example``::

        import gepa.optimize_anything as oa

        def evaluate(candidate):
            result = run_code(candidate["code"])
            oa.log(f"Output: {result}")      # ASI via oa.log()
            return compute_score(result)

    **Multi-Task Search / Generalization** (``dataset`` provided) — called per example::

        def evaluate(candidate, example):
            pred = run(candidate["prompt"], example["input"])
            score = 1.0 if pred == example["expected"] else 0.0
            return score, {"Input": example["input"], "Output": pred}

    Reserved side_info keys:
        ``"log"``, ``"stdout"``, ``"stderr"`` are used by GEPA's
        automatic capture.  If you use them, GEPA stores captured
        output under ``"_gepa_log"`` etc. to avoid collisions.
    """
    ...

Evaluator¶

gepa.optimize_anything.Evaluator ¶

Functions¶

__call__(candidate: str | Candidate, example: object | None = None, **kwargs: Any) -> tuple[float, SideInfo] | float ¶

`gepa.optimize_anything.Evaluator` ¶

`call(candidate: str | Candidate, example: object | None = None, **kwargs: Any) -> tuple[float, SideInfo] | float` ¶