Skip to content

Optimize Anything with LLMs

Automatically optimize prompts for any AI system

Frontier Performance, up to 90x cheaper | 35x faster than RL

Used by Shopify, Databricks and Dropbox

Downloads
Trusted by
teams at

What People Are Saying

Shopify
Tobi Lutke
CEO, Shopify
Both DSPy and (especially) GEPA are currently severely under hyped in the AI context engineering world
View on X
Databricks
Ivan Zhou
Research Engineer, Databricks Mosaic
GEPA can push open models beyond frontier performance; gpt-oss-120b + GEPA beats Claude Opus 4.1 while being 90x cheaper
View on X
Dropbox
Drew Houston
CEO, Dropbox
Have heard great things about DSPy plus GEPA, which is an even stronger prompt optimizer than miprov2 — repo and (fascinating) examples of generated prompts
View on X

Get Started

PyPI GitHub stars Paper License
pip install gepa
import gepa

# Load your dataset
trainset, valset, _ = gepa.examples.aime.init_dataset()

# Define your initial prompt
seed_prompt = {"system_prompt": "You are a helpful assistant..."}

# Run optimization
result = gepa.optimize(
    seed_candidate=seed_prompt,
    trainset=trainset,
    valset=valset,
    task_lm="openai/gpt-4.1-mini",
    max_metric_calls=150,
    reflection_lm="openai/gpt-5",
)

print(result.best_candidate['system_prompt'])

Result: +10% improvement (46.6% → 56.6%) on AIME 2025 with GPT-4.1 Mini

import dspy

class RAG(dspy.Module):
    def __init__(self):
        self.retrieve = dspy.Retrieve(k=3)
        self.generate = dspy.ChainOfThought("context, question -> answer")

    def forward(self, question):
        context = self.retrieve(question).passages
        return self.generate(context=context, question=question)

# Optimize with GEPA
gepa = dspy.GEPA(
    metric=your_metric,
    max_metric_calls=150,
    reflection_lm="openai/gpt-5"
)
optimized_rag = gepa.compile(student=RAG(), trainset=trainset, valset=valset)

GEPA is built into DSPy! See DSPy tutorials for more.

from gepa import optimize
from gepa.core.adapter import EvaluationBatch

class MySystemAdapter:
    def evaluate(self, batch, candidate, capture_traces=False):
        outputs, scores, trajectories = [], [], []
        for example in batch:
            prompt = candidate['my_prompt']
            result = my_system.run(prompt, example)
            score = compute_score(result, example)
            outputs.append(result)
            scores.append(score)
            if capture_traces:
                trajectories.append({
                    'input': example,
                    'output': result.output,
                    'steps': result.intermediate_steps,
                    'errors': result.errors
                })
        return EvaluationBatch(
            outputs=outputs,
            scores=scores,
            trajectories=trajectories if capture_traces else None
        )

    def make_reflective_dataset(self, candidate, eval_batch, components_to_update):
        reflective_data = {}
        for component in components_to_update:
            reflective_data[component] = []
            for traj, score in zip(eval_batch.trajectories, eval_batch.scores):
                reflective_data[component].append({
                    'Inputs': traj['input'],
                    'Generated Outputs': traj['output'],
                    'Feedback': f"Score: {score}. Errors: {traj['errors']}"
                })
        return reflective_data

result = optimize(
    seed_candidate={'my_prompt': 'Initial prompt...'},
    trainset=my_trainset,
    valset=my_valset,
    adapter=MySystemAdapter(),
    task_lm="openai/gpt-4.1-mini",
)

Results

How It Works

Traditional optimizers (RL, evolutionary strategies) collapse rich execution traces into a single scalar reward — they know that a candidate failed, but not why. GEPA takes a different approach: evaluators return Actionable Side Information (ASI) — error messages, profiling data, reasoning logs — and an LLM reads this feedback to diagnose failures and propose targeted fixes. Each mutation inherits accumulated lessons from all ancestors in the search tree. GEPA also supports system-aware merge — combining strengths of two pareto-optimal candidates excelling on different tasks.

1
Select from
Pareto Front
Pick candidate excelling on some examples
2
Run on
Minibatch
Execute & capture full traces
3
Reflect with
LLM
Diagnose failures in natural language
4
Mutate
Prompt
Accumulate lessons from ancestors and new rollouts
5
Accept if
Improved
Add to pool & update Pareto front
Repeat until convergence
Based on research from UC Berkeley, Stanford, MIT & Databricks. Learn how it works → | Read the paper →

GEPA Case Studies

GEPA Shines When

Rollouts Are Expensive
  • Scientific simulations, slow compilation
  • Complex agents with long tool calls
  • 100–500 evals vs. 10K+ for RL
Data Is Scarce
  • New hardware with zero training data
  • Works with as few as 3 examples
  • Rapid prototyping on novel domains
API-Only Models
  • No weights access needed
  • Optimize GPT-5, Claude, Gemini directly
  • No custom infra or fine-tuning pipeline
You Need Interpretability
  • Human-readable optimization traces
  • See why each prompt changed
  • Debug complex agent behaviors

GEPA complements RL and fine-tuning

These approaches are not mutually exclusive. Use GEPA for rapid initial optimization (minutes to hours, API-only access), then apply RL or fine-tuning for additional gains as demonstrated in BetterTogether / mmGRPO recipe. For scenarios with abundant data and 100,000+ cheap rollouts, gradient-based methods remain effective — GEPA works best when rollouts are expensive, data is scarce, or you need interpretable optimization traces.

Community & Resources

Built something with GEPA? We'd love to feature your work.

Citation

If you use GEPA in your research, please cite our paper:

@misc{agrawal2025gepareflectivepromptevolution,
      title={GEPA: Reflective Prompt Evolution Can Outperform Reinforcement Learning},
      author={Lakshya A Agrawal and Shangyin Tan and Dilara Soylu and Noah Ziems and
              Rishi Khare and Krista Opsahl-Ong and Arnav Singhvi and Herumb Shandilya and
              Michael J Ryan and Meng Jiang and Christopher Potts and Koushik Sen and
              Alexandros G. Dimakis and Ion Stoica and Dan Klein and Matei Zaharia and Omar Khattab},
      year={2025},
      eprint={2507.19457},
      archivePrefix={arXiv},
      primaryClass={cs.CL},
      url={https://arxiv.org/abs/2507.19457}
}