ConfidenceAdapter¶

gepa.adapters.confidence_adapter.confidence_adapter.ConfidenceAdapter(model: str | ChatCompletionCallable, field_path: str, response_format: dict[str, Any] | None = None, response_schema: type | dict[str, Any] | None = None, scoring_strategy: ScoringStrategy | None = None, answer_field: str | None = None, high_confidence_threshold: float = 0.99, low_confidence_threshold: float = 0.9, top_logprobs: int = 5, failure_score: float = 0.0, max_litellm_workers: int = 10, litellm_batch_completion_kwargs: dict[str, Any] | None = None) ¶

Bases: GEPAAdapter[ConfidenceDataInst, ConfidenceTrajectory, ConfidenceRolloutOutput]

GEPA adapter for structured-output classification with logprob confidence.

This adapter is specifically designed for classification tasks where the LLM returns a structured JSON output with an enum-constrained field. The enum constraint is critical: it forces the model to choose from a closed set of categories, and the logprobs then represent the model's true probability distribution over those categories.

The adapter owns the full LLM call lifecycle:

Sends requests with logprobs=True and response_format
Parses the structured JSON output
Extracts the joint logprob (sum of per-token logprobs) for the target field via llm-structured-confidence
Computes a blended score via the pluggable :class:ScoringStrategy
Generates rich reflective feedback with confidence details

Why joint logprob? ~~~~~~~~~~~~~~~~~~ The joint_logprob is the sum of all per-token logprobs for the value tokens of the target field. For example, if the model outputs "Bills/Electricity" and the tokens are ["Bills", "/", "Elec", "tricity"] with logprobs [-0.02, -0.01, -0.10, -0.01], the joint logprob is -0.14.

This is the most natural confidence measure because:

It captures the total uncertainty across all tokens (not just the average), so longer values with one uncertain token are correctly penalised.
exp(joint_logprob) gives the joint probability -- the probability the model assigns to the entire value as a whole.
It is numerically stable and works well across different tokenisations.

Parameters¶

model: Either a litellm model string (e.g. "openai/gpt-4.1-mini") or a callable that takes messages and returns the full response object (must include logprobs). field_path: Path to the target field in the JSON response, using the syntax of llm-structured-confidence: "category_name" for a top-level field, "classification.name" for a nested field, or "results[].category" for an array of objects. response_format: JSON schema dict passed to litellm as response_format. Should define enum constraints on the target field for meaningful confidence extraction. Required when model is a string. response_schema: Optional Pydantic model or dict schema passed to extract_logprobs(response_schema=…) for enum resolution. When provided, TopAlternative.resolved_value maps token prefixes back to full enum values (e.g. "Pos" -> "Positive"). scoring_strategy: How to blend correctness and logprob confidence into a single score. Defaults to :class:LinearBlendScoring. answer_field: JSON field path used to extract the answer from the response text. Defaults to field_path (they are usually the same). high_confidence_threshold: Probability threshold (in (0, 1]) above which a prediction is labelled "high confidence" in reflective feedback. Models using structured output with enum constraints typically produce probabilities above 95%, so this should be set high (e.g. 0.99) to produce useful feedback gradients. Default 0.99. low_confidence_threshold: Probability threshold (in (0, 1)) below which a correct prediction is labelled "unreliable" in reflective feedback. Default 0.90. top_logprobs: Number of top logprobs to request from the LLM (1-20). failure_score: Score assigned when an example fails (parse error, API error, etc.). max_litellm_workers: Concurrency for litellm calls (only used when model is a string). litellm_batch_completion_kwargs: Extra keyword arguments forwarded to every litellm.batch_completion call (e.g. temperature, max_tokens).

Source code in gepa/adapters/confidence_adapter/confidence_adapter.py

def __init__(
    self,
    model: str | ChatCompletionCallable,
    field_path: str,
    response_format: dict[str, Any] | None = None,
    response_schema: type | dict[str, Any] | None = None,
    scoring_strategy: ScoringStrategy | None = None,
    answer_field: str | None = None,
    high_confidence_threshold: float = 0.99,
    low_confidence_threshold: float = 0.90,
    top_logprobs: int = 5,
    failure_score: float = 0.0,
    max_litellm_workers: int = 10,
    litellm_batch_completion_kwargs: dict[str, Any] | None = None,
) -> None:
    if isinstance(model, str):
        import litellm

        self.litellm = litellm
        if response_format is None:
            raise ValueError(
                "response_format is required when model is a string (LiteLLM path). "
                "Provide a JSON schema with enum constraints for structured classification output."
            )
    self.model = model
    self.field_path = field_path
    self.response_format = response_format
    self.response_schema = response_schema
    self.scoring_strategy: ScoringStrategy = scoring_strategy or LinearBlendScoring(
        low_confidence_threshold=high_confidence_threshold,
    )
    self.answer_field = answer_field or field_path
    self.high_confidence_threshold = high_confidence_threshold
    self.low_confidence_threshold = low_confidence_threshold
    self.top_logprobs = top_logprobs
    self.failure_score = failure_score
    self.max_litellm_workers = max_litellm_workers
    self.litellm_batch_completion_kwargs = litellm_batch_completion_kwargs or {}

Attributes¶

`propose_new_texts: ProposalFn | None = None` `class-attribute` `instance-attribute` ¶

`litellm = litellm` `instance-attribute` ¶

`model = model` `instance-attribute` ¶

`field_path = field_path` `instance-attribute` ¶

`response_format = response_format` `instance-attribute` ¶

`response_schema = response_schema` `instance-attribute` ¶

`scoring_strategy: ScoringStrategy = scoring_strategy or LinearBlendScoring(low_confidence_threshold=high_confidence_threshold)` `instance-attribute` ¶

`answer_field = answer_field or field_path` `instance-attribute` ¶

`high_confidence_threshold = high_confidence_threshold` `instance-attribute` ¶

`low_confidence_threshold = low_confidence_threshold` `instance-attribute` ¶

`top_logprobs = top_logprobs` `instance-attribute` ¶

`failure_score = failure_score` `instance-attribute` ¶

`max_litellm_workers = max_litellm_workers` `instance-attribute` ¶

`litellm_batch_completion_kwargs = litellm_batch_completion_kwargs or {}` `instance-attribute` ¶

Functions¶

`evaluate(batch: list[ConfidenceDataInst], candidate: dict[str, str], capture_traces: bool = False) -> EvaluationBatch[ConfidenceTrajectory, ConfidenceRolloutOutput]` ¶

Run candidate on batch, extracting logprob confidence.

Uses litellm.batch_completion for parallel LLM calls, then post-processes each response to extract logprobs and build feedback.

Source code in gepa/adapters/confidence_adapter/confidence_adapter.py

def evaluate(
    self,
    batch: list[ConfidenceDataInst],
    candidate: dict[str, str],
    capture_traces: bool = False,
) -> EvaluationBatch[ConfidenceTrajectory, ConfidenceRolloutOutput]:
    """Run *candidate* on *batch*, extracting logprob confidence.

    Uses ``litellm.batch_completion`` for parallel LLM calls, then
    post-processes each response to extract logprobs and build feedback.
    """
    system_content = next(iter(candidate.values()))

    all_messages: list[list[ChatMessage]] = [
        [
            {"role": "system", "content": system_content},
            {"role": "user", "content": data["input"]},
        ]
        for data in batch
    ]

    if isinstance(self.model, str):
        batch_kwargs: dict[str, Any] = {
            "model": self.model,
            "messages": all_messages,
            "max_workers": self.max_litellm_workers,
            "logprobs": True,
            "top_logprobs": self.top_logprobs,
            **self.litellm_batch_completion_kwargs,
        }
        if self.response_format is not None:
            batch_kwargs["response_format"] = self.response_format
        responses = list(self.litellm.batch_completion(**batch_kwargs))
    else:
        responses: list[Any] = []
        for msgs in all_messages:
            try:
                responses.append(self.model(msgs))
            except Exception as exc:
                responses.append(exc)

    outputs: list[ConfidenceRolloutOutput] = []
    scores: list[float] = []
    objective_scores_list: list[dict[str, float]] = []
    trajectories: list[ConfidenceTrajectory] | None = [] if capture_traces else None

    for data, response in zip(batch, responses, strict=True):
        output, score, obj_scores, trajectory = self._process_response(
            response,
            data,
            capture_traces,
        )
        outputs.append(output)
        scores.append(score)
        objective_scores_list.append(obj_scores)
        if trajectories is not None and trajectory is not None:
            trajectories.append(trajectory)

    return EvaluationBatch(
        outputs=outputs,
        scores=scores,
        trajectories=trajectories,
        objective_scores=objective_scores_list,
    )

`make_reflective_dataset(candidate: dict[str, str], eval_batch: EvaluationBatch[ConfidenceTrajectory, ConfidenceRolloutOutput], components_to_update: list[str]) -> Mapping[str, Sequence[Mapping[str, Any]]]` ¶

Build reflective dataset with confidence-enriched feedback.

The feedback tells the reflection LLM why the task model was uncertain, not just whether it was correct. This enables GEPA to evolve prompts that resolve specific ambiguities between categories.

Each record in the dataset contains:

Inputs: the original user input text.
Generated Outputs: the model's answer annotated with probability.
Feedback: a diagnosis including the probability, the top competing alternatives, and guidance for what the prompt should improve.

Source code in gepa/adapters/confidence_adapter/confidence_adapter.py

def make_reflective_dataset(
    self,
    candidate: dict[str, str],
    eval_batch: EvaluationBatch[ConfidenceTrajectory, ConfidenceRolloutOutput],
    components_to_update: list[str],
) -> Mapping[str, Sequence[Mapping[str, Any]]]:
    """Build reflective dataset with confidence-enriched feedback.

    The feedback tells the reflection LLM *why* the task model was
    uncertain, not just whether it was correct.  This enables GEPA to
    evolve prompts that resolve specific ambiguities between categories.

    Each record in the dataset contains:

    * **Inputs**: the original user input text.
    * **Generated Outputs**: the model's answer annotated with
      probability.
    * **Feedback**: a diagnosis including the probability, the top
      competing alternatives, and guidance for what the prompt should
      improve.
    """
    assert len(components_to_update) == 1
    comp = components_to_update[0]

    trajectories = eval_batch.trajectories
    assert trajectories is not None, "Trajectories are required to build a reflective dataset."

    items: list[ConfidenceReflectiveRecord] = []
    for traj in trajectories:
        generated = traj["parsed_value"] or traj["full_assistant_response"]
        if traj["logprob_score"] is not None:
            probability = math.exp(traj["logprob_score"])
            generated += f" ({probability:.0%} probability)"

        items.append(
            {
                "Inputs": traj["data"]["input"],
                "Generated Outputs": generated,
                "Feedback": traj["feedback"],
            }
        )

    if not items:
        raise Exception("No valid predictions found for any module.")

    return {comp: items}

ConfidenceAdapter¶

Parameters¶

Attributes¶

propose_new_texts: ProposalFn | None = None class-attribute instance-attribute ¶

litellm = litellm instance-attribute ¶

model = model instance-attribute ¶

field_path = field_path instance-attribute ¶

response_format = response_format instance-attribute ¶

response_schema = response_schema instance-attribute ¶

scoring_strategy: ScoringStrategy = scoring_strategy or LinearBlendScoring(low_confidence_threshold=high_confidence_threshold) instance-attribute ¶

answer_field = answer_field or field_path instance-attribute ¶

high_confidence_threshold = high_confidence_threshold instance-attribute ¶

low_confidence_threshold = low_confidence_threshold instance-attribute ¶

top_logprobs = top_logprobs instance-attribute ¶

failure_score = failure_score instance-attribute ¶

max_litellm_workers = max_litellm_workers instance-attribute ¶

litellm_batch_completion_kwargs = litellm_batch_completion_kwargs or {} instance-attribute ¶

Functions¶

evaluate(batch: list[ConfidenceDataInst], candidate: dict[str, str], capture_traces: bool = False) -> EvaluationBatch[ConfidenceTrajectory, ConfidenceRolloutOutput] ¶

make_reflective_dataset(candidate: dict[str, str], eval_batch: EvaluationBatch[ConfidenceTrajectory, ConfidenceRolloutOutput], components_to_update: list[str]) -> Mapping[str, Sequence[Mapping[str, Any]]] ¶

`propose_new_texts: ProposalFn | None = None` `class-attribute` `instance-attribute` ¶

`litellm = litellm` `instance-attribute` ¶

`model = model` `instance-attribute` ¶

`field_path = field_path` `instance-attribute` ¶

`response_format = response_format` `instance-attribute` ¶

`response_schema = response_schema` `instance-attribute` ¶

`scoring_strategy: ScoringStrategy = scoring_strategy or LinearBlendScoring(low_confidence_threshold=high_confidence_threshold)` `instance-attribute` ¶

`answer_field = answer_field or field_path` `instance-attribute` ¶

`high_confidence_threshold = high_confidence_threshold` `instance-attribute` ¶

`low_confidence_threshold = low_confidence_threshold` `instance-attribute` ¶

`top_logprobs = top_logprobs` `instance-attribute` ¶

`failure_score = failure_score` `instance-attribute` ¶

`max_litellm_workers = max_litellm_workers` `instance-attribute` ¶

`litellm_batch_completion_kwargs = litellm_batch_completion_kwargs or {}` `instance-attribute` ¶

`evaluate(batch: list[ConfidenceDataInst], candidate: dict[str, str], capture_traces: bool = False) -> EvaluationBatch[ConfidenceTrajectory, ConfidenceRolloutOutput]` ¶

`make_reflective_dataset(candidate: dict[str, str], eval_batch: EvaluationBatch[ConfidenceTrajectory, ConfidenceRolloutOutput], components_to_update: list[str]) -> Mapping[str, Sequence[Mapping[str, Any]]]` ¶