Skip to content

Get Started

GEPA Logo

Automatically optimize prompts for any AI system

Frontier Performance, up to 90x cheaper | 35x faster than RL

Used by Shopify, Databricks and Dropbox

Downloads
Trusted by
teams at

What People Are Saying

Shopify
Tobi Lutke
CEO, Shopify
Both DSPy and (especially) GEPA are currently severely under hyped in the AI context engineering world
View on X
Databricks
Ivan Zhou
Research Engineer, Databricks Mosaic
GEPA can push open models beyond frontier performance; gpt-oss-120b + GEPA beats Claude Opus 4.1 while being 90x cheaper
View on X
Dropbox
Drew Houston
CEO, Dropbox
Have heard great things about DSPy plus GEPA, which is an even stronger prompt optimizer than miprov2 — repo and (fascinating) examples of generated prompts
View on X

Get Started

PyPI GitHub stars Paper License
pip install gepa
import gepa

# Load your dataset
trainset, valset, _ = gepa.examples.aime.init_dataset()

# Define your initial prompt
seed_prompt = {"system_prompt": "You are a helpful assistant..."}

# Run optimization
result = gepa.optimize(
    seed_candidate=seed_prompt,
    trainset=trainset,
    valset=valset,
    task_lm="openai/gpt-4.1-mini",
    max_metric_calls=150,
    reflection_lm="openai/gpt-5",
)

print(result.best_candidate['system_prompt'])

Result: +10% improvement (46.6% → 56.6%) on AIME 2025 with GPT-4.1 Mini

import dspy

class RAG(dspy.Module):
    def __init__(self):
        self.retrieve = dspy.Retrieve(k=3)
        self.generate = dspy.ChainOfThought("context, question -> answer")

    def forward(self, question):
        context = self.retrieve(question).passages
        return self.generate(context=context, question=question)

# Optimize with GEPA
gepa = dspy.GEPA(
    metric=your_metric,
    max_metric_calls=150,
    reflection_lm="openai/gpt-5"
)
optimized_rag = gepa.compile(student=RAG(), trainset=trainset, valset=valset)

GEPA is built into DSPy! See DSPy tutorials for more.

from gepa import optimize
from gepa.core.adapter import EvaluationBatch

class MySystemAdapter:
    def evaluate(self, batch, candidate, capture_traces=False):
        outputs, scores, trajectories = [], [], []
        for example in batch:
            prompt = candidate['my_prompt']
            result = my_system.run(prompt, example)
            score = compute_score(result, example)
            outputs.append(result)
            scores.append(score)
            if capture_traces:
                trajectories.append({
                    'input': example,
                    'output': result.output,
                    'steps': result.intermediate_steps,
                    'errors': result.errors
                })
        return EvaluationBatch(
            outputs=outputs,
            scores=scores,
            trajectories=trajectories if capture_traces else None
        )

    def make_reflective_dataset(self, candidate, eval_batch, components_to_update):
        reflective_data = {}
        for component in components_to_update:
            reflective_data[component] = []
            for traj, score in zip(eval_batch.trajectories, eval_batch.scores):
                reflective_data[component].append({
                    'Inputs': traj['input'],
                    'Generated Outputs': traj['output'],
                    'Feedback': f"Score: {score}. Errors: {traj['errors']}"
                })
        return reflective_data

result = optimize(
    seed_candidate={'my_prompt': 'Initial prompt...'},
    trainset=my_trainset,
    valset=my_valset,
    adapter=MySystemAdapter(),
    task_lm="openai/gpt-4.1-mini",
)

Results

How It Works

1
Select from
Pareto Front
Pick candidate excelling on some examples
2
Run on
Minibatch
Execute & capture full traces
3
Reflect with
LLM
Diagnose failures in natural language
4
Mutate
Prompt
Accumulate lessons from ancestors and new rollouts
5
Accept if
Improved
Add to pool & update Pareto front
Repeat until convergence
Uses natural language reasoning instead of gradients to diagnose failures and improve prompts. Each mutation inherits lessons from all ancestors.
Based on research from UC Berkeley, Stanford, MIT & Databricks.
Read the paper →

When to Use GEPA

Why Choose GEPA?

Feature GEPA Reinforcement Learning Manual Prompting
Cost Low Very High Low
Sample Efficiency High (150 calls) Low (10K+ calls) N/A
Performance SOTA SOTA Suboptimal
Interpretability Natural Language Black Box Clear
Setup Time Minutes Days/Weeks Minutes
Framework Support Any System Framework Specific Any System
Multi-Objective Native Complex Manual

Community & Resources

Built something with GEPA? We'd love to feature your work.

Citation

If you use GEPA in your research, please cite our paper:

@misc{agrawal2025gepareflectivepromptevolution,
      title={GEPA: Reflective Prompt Evolution Can Outperform Reinforcement Learning},
      author={Lakshya A Agrawal and Shangyin Tan and Dilara Soylu and Noah Ziems and
              Rishi Khare and Krista Opsahl-Ong and Arnav Singhvi and Herumb Shandilya and
              Michael J Ryan and Meng Jiang and Christopher Potts and Koushik Sen and
              Alexandros G. Dimakis and Ion Stoica and Dan Klein and Matei Zaharia and Omar Khattab},
      year={2025},
      eprint={2507.19457},
      archivePrefix={arXiv},
      primaryClass={cs.CL},
      url={https://arxiv.org/abs/2507.19457}
}