ConfidenceAdapter¶
gepa.adapters.confidence_adapter.confidence_adapter.ConfidenceAdapter(model: str | ChatCompletionCallable, field_path: str, response_format: dict[str, Any] | None = None, response_schema: type | dict[str, Any] | None = None, scoring_strategy: ScoringStrategy | None = None, answer_field: str | None = None, high_confidence_threshold: float = 0.99, low_confidence_threshold: float = 0.9, top_logprobs: int = 5, failure_score: float = 0.0, max_litellm_workers: int = 10, litellm_batch_completion_kwargs: dict[str, Any] | None = None)
¶
Bases: GEPAAdapter[ConfidenceDataInst, ConfidenceTrajectory, ConfidenceRolloutOutput]
GEPA adapter for structured-output classification with logprob confidence.
This adapter is specifically designed for classification tasks where
the LLM returns a structured JSON output with an enum-constrained
field. The enum constraint is critical: it forces the model to
choose from a closed set of categories, and the logprobs then represent
the model's true probability distribution over those categories.
The adapter owns the full LLM call lifecycle:
- Sends requests with
logprobs=Trueandresponse_format - Parses the structured JSON output
- Extracts the joint logprob (sum of per-token logprobs) for the
target field via
llm-structured-confidence - Computes a blended score via the pluggable :class:
ScoringStrategy - Generates rich reflective feedback with confidence details
Why joint logprob?
~~~~~~~~~~~~~~~~~~
The joint_logprob is the sum of all per-token logprobs for the
value tokens of the target field. For example, if the model outputs
"Bills/Electricity" and the tokens are ["Bills", "/", "Elec",
"tricity"] with logprobs [-0.02, -0.01, -0.10, -0.01], the
joint logprob is -0.14.
This is the most natural confidence measure because:
- It captures the total uncertainty across all tokens (not just the average), so longer values with one uncertain token are correctly penalised.
exp(joint_logprob)gives the joint probability -- the probability the model assigns to the entire value as a whole.- It is numerically stable and works well across different tokenisations.
Parameters¶
model:
Either a litellm model string (e.g. "openai/gpt-4.1-mini") or a
callable that takes messages and returns the full response object
(must include logprobs).
field_path:
Path to the target field in the JSON response, using the syntax of
llm-structured-confidence: "category_name" for a top-level
field, "classification.name" for a nested field, or
"results[].category" for an array of objects.
response_format:
JSON schema dict passed to litellm as response_format. Should
define enum constraints on the target field for meaningful
confidence extraction. Required when model is a string.
response_schema:
Optional Pydantic model or dict schema passed to
extract_logprobs(response_schema=…) for enum resolution.
When provided, TopAlternative.resolved_value maps token
prefixes back to full enum values (e.g. "Pos" -> "Positive").
scoring_strategy:
How to blend correctness and logprob confidence into a single score.
Defaults to :class:LinearBlendScoring.
answer_field:
JSON field path used to extract the answer from the response text.
Defaults to field_path (they are usually the same).
high_confidence_threshold:
Probability threshold (in (0, 1]) above which a prediction is
labelled "high confidence" in reflective feedback. Models using
structured output with enum constraints typically produce
probabilities above 95%, so this should be set high (e.g. 0.99)
to produce useful feedback gradients. Default 0.99.
low_confidence_threshold:
Probability threshold (in (0, 1)) below which a correct
prediction is labelled "unreliable" in reflective feedback.
Default 0.90.
top_logprobs:
Number of top logprobs to request from the LLM (1-20).
failure_score:
Score assigned when an example fails (parse error, API error, etc.).
max_litellm_workers:
Concurrency for litellm calls (only used when model is a string).
litellm_batch_completion_kwargs:
Extra keyword arguments forwarded to every litellm.batch_completion
call (e.g. temperature, max_tokens).
Source code in gepa/adapters/confidence_adapter/confidence_adapter.py
Attributes¶
propose_new_texts: ProposalFn | None = None
class-attribute
instance-attribute
¶
litellm = litellm
instance-attribute
¶
model = model
instance-attribute
¶
field_path = field_path
instance-attribute
¶
response_format = response_format
instance-attribute
¶
response_schema = response_schema
instance-attribute
¶
scoring_strategy: ScoringStrategy = scoring_strategy or LinearBlendScoring(low_confidence_threshold=high_confidence_threshold)
instance-attribute
¶
answer_field = answer_field or field_path
instance-attribute
¶
high_confidence_threshold = high_confidence_threshold
instance-attribute
¶
low_confidence_threshold = low_confidence_threshold
instance-attribute
¶
top_logprobs = top_logprobs
instance-attribute
¶
failure_score = failure_score
instance-attribute
¶
max_litellm_workers = max_litellm_workers
instance-attribute
¶
litellm_batch_completion_kwargs = litellm_batch_completion_kwargs or {}
instance-attribute
¶
Functions¶
evaluate(batch: list[ConfidenceDataInst], candidate: dict[str, str], capture_traces: bool = False) -> EvaluationBatch[ConfidenceTrajectory, ConfidenceRolloutOutput]
¶
Run candidate on batch, extracting logprob confidence.
Uses litellm.batch_completion for parallel LLM calls, then
post-processes each response to extract logprobs and build feedback.
Source code in gepa/adapters/confidence_adapter/confidence_adapter.py
make_reflective_dataset(candidate: dict[str, str], eval_batch: EvaluationBatch[ConfidenceTrajectory, ConfidenceRolloutOutput], components_to_update: list[str]) -> Mapping[str, Sequence[Mapping[str, Any]]]
¶
Build reflective dataset with confidence-enriched feedback.
The feedback tells the reflection LLM why the task model was uncertain, not just whether it was correct. This enables GEPA to evolve prompts that resolve specific ambiguities between categories.
Each record in the dataset contains:
- Inputs: the original user input text.
- Generated Outputs: the model's answer annotated with probability.
- Feedback: a diagnosis including the probability, the top competing alternatives, and guidance for what the prompt should improve.