Optimize Anything with LLMs
Automatically optimize prompts for any AI system
Frontier Performance, up to 90x cheaper | 35x faster than RL
Used by Shopify, Databricks and Dropbox
teams at
What People Are Saying¶
Get Started¶
import gepa
# Load your dataset
trainset, valset, _ = gepa.examples.aime.init_dataset()
# Define your initial prompt
seed_prompt = {"system_prompt": "You are a helpful assistant..."}
# Run optimization
result = gepa.optimize(
seed_candidate=seed_prompt,
trainset=trainset,
valset=valset,
task_lm="openai/gpt-4.1-mini",
max_metric_calls=150,
reflection_lm="openai/gpt-5",
)
print(result.best_candidate['system_prompt'])
Result: +10% improvement (46.6% → 56.6%) on AIME 2025 with GPT-4.1 Mini
import dspy
class RAG(dspy.Module):
def __init__(self):
self.retrieve = dspy.Retrieve(k=3)
self.generate = dspy.ChainOfThought("context, question -> answer")
def forward(self, question):
context = self.retrieve(question).passages
return self.generate(context=context, question=question)
# Optimize with GEPA
gepa = dspy.GEPA(
metric=your_metric,
max_metric_calls=150,
reflection_lm="openai/gpt-5"
)
optimized_rag = gepa.compile(student=RAG(), trainset=trainset, valset=valset)
GEPA is built into DSPy! See DSPy tutorials for more.
from gepa import optimize
from gepa.core.adapter import EvaluationBatch
class MySystemAdapter:
def evaluate(self, batch, candidate, capture_traces=False):
outputs, scores, trajectories = [], [], []
for example in batch:
prompt = candidate['my_prompt']
result = my_system.run(prompt, example)
score = compute_score(result, example)
outputs.append(result)
scores.append(score)
if capture_traces:
trajectories.append({
'input': example,
'output': result.output,
'steps': result.intermediate_steps,
'errors': result.errors
})
return EvaluationBatch(
outputs=outputs,
scores=scores,
trajectories=trajectories if capture_traces else None
)
def make_reflective_dataset(self, candidate, eval_batch, components_to_update):
reflective_data = {}
for component in components_to_update:
reflective_data[component] = []
for traj, score in zip(eval_batch.trajectories, eval_batch.scores):
reflective_data[component].append({
'Inputs': traj['input'],
'Generated Outputs': traj['output'],
'Feedback': f"Score: {score}. Errors: {traj['errors']}"
})
return reflective_data
result = optimize(
seed_candidate={'my_prompt': 'Initial prompt...'},
trainset=my_trainset,
valset=my_valset,
adapter=MySystemAdapter(),
task_lm="openai/gpt-4.1-mini",
)
Results¶
How It Works¶
Traditional optimizers (RL, evolutionary strategies) collapse rich execution traces into a single scalar reward — they know that a candidate failed, but not why. GEPA takes a different approach: evaluators return Actionable Side Information (ASI) — error messages, profiling data, reasoning logs — and an LLM reads this feedback to diagnose failures and propose targeted fixes. Each mutation inherits accumulated lessons from all ancestors in the search tree. GEPA also supports system-aware merge — combining strengths of two pareto-optimal candidates excelling on different tasks.
Pareto Front
Minibatch
LLM
Prompt
Improved
GEPA Case Studies¶
- 90x cost reduction at Databricks
- Self-evolving agents in OpenAI Cookbook
- Core algorithm in Comet ML Opik
- Production incident diagnosis
- Data analysis agents (FireBird)
- Code safety monitoring
- Healthcare multi-agent RAG systems
- 38% OCR error reduction
- Market research AI personas
- AI safety & misalignment detection
- Clinical NLP & medical error detection
- Agent architecture discovery
- Multi-objective optimization
- Adversarial prompt search
GEPA Shines When¶
- Scientific simulations, slow compilation
- Complex agents with long tool calls
- 100–500 evals vs. 10K+ for RL
- New hardware with zero training data
- Works with as few as 3 examples
- Rapid prototyping on novel domains
- No weights access needed
- Optimize GPT-5, Claude, Gemini directly
- No custom infra or fine-tuning pipeline
- Human-readable optimization traces
- See why each prompt changed
- Debug complex agent behaviors
GEPA complements RL and fine-tuning
These approaches are not mutually exclusive. Use GEPA for rapid initial optimization (minutes to hours, API-only access), then apply RL or fine-tuning for additional gains as demonstrated in BetterTogether / mmGRPO recipe. For scenarios with abundant data and 100,000+ cheap rollouts, gradient-based methods remain effective — GEPA works best when rollouts are expensive, data is scarce, or you need interpretable optimization traces.
Community & Resources¶
Built something with GEPA? We'd love to feature your work.
Citation¶
If you use GEPA in your research, please cite our paper:
@misc{agrawal2025gepareflectivepromptevolution,
title={GEPA: Reflective Prompt Evolution Can Outperform Reinforcement Learning},
author={Lakshya A Agrawal and Shangyin Tan and Dilara Soylu and Noah Ziems and
Rishi Khare and Krista Opsahl-Ong and Arnav Singhvi and Herumb Shandilya and
Michael J Ryan and Meng Jiang and Christopher Potts and Koushik Sen and
Alexandros G. Dimakis and Ion Stoica and Dan Klein and Matei Zaharia and Omar Khattab},
year={2025},
eprint={2507.19457},
archivePrefix={arXiv},
primaryClass={cs.CL},
url={https://arxiv.org/abs/2507.19457}
}