Skip to content

ConfidenceAdapter

gepa.adapters.confidence_adapter.confidence_adapter.ConfidenceAdapter(model: str | ChatCompletionCallable, field_path: str, response_format: dict[str, Any] | None = None, response_schema: type | dict[str, Any] | None = None, scoring_strategy: ScoringStrategy | None = None, answer_field: str | None = None, high_confidence_threshold: float = 0.99, low_confidence_threshold: float = 0.9, top_logprobs: int = 5, failure_score: float = 0.0, max_litellm_workers: int = 10, litellm_batch_completion_kwargs: dict[str, Any] | None = None)

Bases: GEPAAdapter[ConfidenceDataInst, ConfidenceTrajectory, ConfidenceRolloutOutput]

GEPA adapter for structured-output classification with logprob confidence.

This adapter is specifically designed for classification tasks where the LLM returns a structured JSON output with an enum-constrained field. The enum constraint is critical: it forces the model to choose from a closed set of categories, and the logprobs then represent the model's true probability distribution over those categories.

The adapter owns the full LLM call lifecycle:

  1. Sends requests with logprobs=True and response_format
  2. Parses the structured JSON output
  3. Extracts the joint logprob (sum of per-token logprobs) for the target field via llm-structured-confidence
  4. Computes a blended score via the pluggable :class:ScoringStrategy
  5. Generates rich reflective feedback with confidence details

Why joint logprob? ~~~~~~~~~~~~~~~~~~ The joint_logprob is the sum of all per-token logprobs for the value tokens of the target field. For example, if the model outputs "Bills/Electricity" and the tokens are ["Bills", "/", "Elec", "tricity"] with logprobs [-0.02, -0.01, -0.10, -0.01], the joint logprob is -0.14.

This is the most natural confidence measure because:

  • It captures the total uncertainty across all tokens (not just the average), so longer values with one uncertain token are correctly penalised.
  • exp(joint_logprob) gives the joint probability -- the probability the model assigns to the entire value as a whole.
  • It is numerically stable and works well across different tokenisations.

Parameters

model: Either a litellm model string (e.g. "openai/gpt-4.1-mini") or a callable that takes messages and returns the full response object (must include logprobs). field_path: Path to the target field in the JSON response, using the syntax of llm-structured-confidence: "category_name" for a top-level field, "classification.name" for a nested field, or "results[].category" for an array of objects. response_format: JSON schema dict passed to litellm as response_format. Should define enum constraints on the target field for meaningful confidence extraction. Required when model is a string. response_schema: Optional Pydantic model or dict schema passed to extract_logprobs(response_schema=…) for enum resolution. When provided, TopAlternative.resolved_value maps token prefixes back to full enum values (e.g. "Pos" -> "Positive"). scoring_strategy: How to blend correctness and logprob confidence into a single score. Defaults to :class:LinearBlendScoring. answer_field: JSON field path used to extract the answer from the response text. Defaults to field_path (they are usually the same). high_confidence_threshold: Probability threshold (in (0, 1]) above which a prediction is labelled "high confidence" in reflective feedback. Models using structured output with enum constraints typically produce probabilities above 95%, so this should be set high (e.g. 0.99) to produce useful feedback gradients. Default 0.99. low_confidence_threshold: Probability threshold (in (0, 1)) below which a correct prediction is labelled "unreliable" in reflective feedback. Default 0.90. top_logprobs: Number of top logprobs to request from the LLM (1-20). failure_score: Score assigned when an example fails (parse error, API error, etc.). max_litellm_workers: Concurrency for litellm calls (only used when model is a string). litellm_batch_completion_kwargs: Extra keyword arguments forwarded to every litellm.batch_completion call (e.g. temperature, max_tokens).

Source code in gepa/adapters/confidence_adapter/confidence_adapter.py
def __init__(
    self,
    model: str | ChatCompletionCallable,
    field_path: str,
    response_format: dict[str, Any] | None = None,
    response_schema: type | dict[str, Any] | None = None,
    scoring_strategy: ScoringStrategy | None = None,
    answer_field: str | None = None,
    high_confidence_threshold: float = 0.99,
    low_confidence_threshold: float = 0.90,
    top_logprobs: int = 5,
    failure_score: float = 0.0,
    max_litellm_workers: int = 10,
    litellm_batch_completion_kwargs: dict[str, Any] | None = None,
) -> None:
    if isinstance(model, str):
        import litellm

        self.litellm = litellm
        if response_format is None:
            raise ValueError(
                "response_format is required when model is a string (LiteLLM path). "
                "Provide a JSON schema with enum constraints for structured classification output."
            )
    self.model = model
    self.field_path = field_path
    self.response_format = response_format
    self.response_schema = response_schema
    self.scoring_strategy: ScoringStrategy = scoring_strategy or LinearBlendScoring(
        low_confidence_threshold=high_confidence_threshold,
    )
    self.answer_field = answer_field or field_path
    self.high_confidence_threshold = high_confidence_threshold
    self.low_confidence_threshold = low_confidence_threshold
    self.top_logprobs = top_logprobs
    self.failure_score = failure_score
    self.max_litellm_workers = max_litellm_workers
    self.litellm_batch_completion_kwargs = litellm_batch_completion_kwargs or {}

Attributes

propose_new_texts: ProposalFn | None = None class-attribute instance-attribute

litellm = litellm instance-attribute

model = model instance-attribute

field_path = field_path instance-attribute

response_format = response_format instance-attribute

response_schema = response_schema instance-attribute

scoring_strategy: ScoringStrategy = scoring_strategy or LinearBlendScoring(low_confidence_threshold=high_confidence_threshold) instance-attribute

answer_field = answer_field or field_path instance-attribute

high_confidence_threshold = high_confidence_threshold instance-attribute

low_confidence_threshold = low_confidence_threshold instance-attribute

top_logprobs = top_logprobs instance-attribute

failure_score = failure_score instance-attribute

max_litellm_workers = max_litellm_workers instance-attribute

litellm_batch_completion_kwargs = litellm_batch_completion_kwargs or {} instance-attribute

Functions

evaluate(batch: list[ConfidenceDataInst], candidate: dict[str, str], capture_traces: bool = False) -> EvaluationBatch[ConfidenceTrajectory, ConfidenceRolloutOutput]

Run candidate on batch, extracting logprob confidence.

Uses litellm.batch_completion for parallel LLM calls, then post-processes each response to extract logprobs and build feedback.

Source code in gepa/adapters/confidence_adapter/confidence_adapter.py
def evaluate(
    self,
    batch: list[ConfidenceDataInst],
    candidate: dict[str, str],
    capture_traces: bool = False,
) -> EvaluationBatch[ConfidenceTrajectory, ConfidenceRolloutOutput]:
    """Run *candidate* on *batch*, extracting logprob confidence.

    Uses ``litellm.batch_completion`` for parallel LLM calls, then
    post-processes each response to extract logprobs and build feedback.
    """
    system_content = next(iter(candidate.values()))

    all_messages: list[list[ChatMessage]] = [
        [
            {"role": "system", "content": system_content},
            {"role": "user", "content": data["input"]},
        ]
        for data in batch
    ]

    if isinstance(self.model, str):
        batch_kwargs: dict[str, Any] = {
            "model": self.model,
            "messages": all_messages,
            "max_workers": self.max_litellm_workers,
            "logprobs": True,
            "top_logprobs": self.top_logprobs,
            **self.litellm_batch_completion_kwargs,
        }
        if self.response_format is not None:
            batch_kwargs["response_format"] = self.response_format
        responses = list(self.litellm.batch_completion(**batch_kwargs))
    else:
        responses: list[Any] = []
        for msgs in all_messages:
            try:
                responses.append(self.model(msgs))
            except Exception as exc:
                responses.append(exc)

    outputs: list[ConfidenceRolloutOutput] = []
    scores: list[float] = []
    objective_scores_list: list[dict[str, float]] = []
    trajectories: list[ConfidenceTrajectory] | None = [] if capture_traces else None

    for data, response in zip(batch, responses, strict=True):
        output, score, obj_scores, trajectory = self._process_response(
            response,
            data,
            capture_traces,
        )
        outputs.append(output)
        scores.append(score)
        objective_scores_list.append(obj_scores)
        if trajectories is not None and trajectory is not None:
            trajectories.append(trajectory)

    return EvaluationBatch(
        outputs=outputs,
        scores=scores,
        trajectories=trajectories,
        objective_scores=objective_scores_list,
    )

make_reflective_dataset(candidate: dict[str, str], eval_batch: EvaluationBatch[ConfidenceTrajectory, ConfidenceRolloutOutput], components_to_update: list[str]) -> Mapping[str, Sequence[Mapping[str, Any]]]

Build reflective dataset with confidence-enriched feedback.

The feedback tells the reflection LLM why the task model was uncertain, not just whether it was correct. This enables GEPA to evolve prompts that resolve specific ambiguities between categories.

Each record in the dataset contains:

  • Inputs: the original user input text.
  • Generated Outputs: the model's answer annotated with probability.
  • Feedback: a diagnosis including the probability, the top competing alternatives, and guidance for what the prompt should improve.
Source code in gepa/adapters/confidence_adapter/confidence_adapter.py
def make_reflective_dataset(
    self,
    candidate: dict[str, str],
    eval_batch: EvaluationBatch[ConfidenceTrajectory, ConfidenceRolloutOutput],
    components_to_update: list[str],
) -> Mapping[str, Sequence[Mapping[str, Any]]]:
    """Build reflective dataset with confidence-enriched feedback.

    The feedback tells the reflection LLM *why* the task model was
    uncertain, not just whether it was correct.  This enables GEPA to
    evolve prompts that resolve specific ambiguities between categories.

    Each record in the dataset contains:

    * **Inputs**: the original user input text.
    * **Generated Outputs**: the model's answer annotated with
      probability.
    * **Feedback**: a diagnosis including the probability, the top
      competing alternatives, and guidance for what the prompt should
      improve.
    """
    assert len(components_to_update) == 1
    comp = components_to_update[0]

    trajectories = eval_batch.trajectories
    assert trajectories is not None, "Trajectories are required to build a reflective dataset."

    items: list[ConfidenceReflectiveRecord] = []
    for traj in trajectories:
        generated = traj["parsed_value"] or traj["full_assistant_response"]
        if traj["logprob_score"] is not None:
            probability = math.exp(traj["logprob_score"])
            generated += f" ({probability:.0%} probability)"

        items.append(
            {
                "Inputs": traj["data"]["input"],
                "Generated Outputs": generated,
                "Feedback": traj["feedback"],
            }
        )

    if not items:
        raise Exception("No valid predictions found for any module.")

    return {comp: items}