Overview
GEPA (Genetic-Pareto) is a Python framework that improves any system with textual parameters against an evaluation metric you define. Instead of collapsing a run into a single number, it uses a language model to read full execution traces (error messages, profiling data, reasoning logs) to work out why a candidate failed, then proposes targeted fixes through iterative reflection, mutation, and Pareto-aware selection.
It is built for developers who tune prompts and AI pipelines and want a systematic way to search for better variants with relatively few evaluations. The same approach extends beyond prompts to code, agent architectures, configurations, and other text artifacts through the optimize_anything API.
As a prompt-programming and LLM-orchestration tool, GEPA fits alongside DSPy, where it is available as dspy.GEPA for optimizing AI programs. You can use it standalone with its own gepa.optimize entry point or inside a DSPy pipeline.
What it does
- Reflective optimization: an LLM reads full execution traces to diagnose failures rather than relying on a single scalar reward
- Pareto-aware evolutionary search that keeps candidates excelling on different task subsets and evolves new variants from them
- Reaches gains in roughly 100-500 evaluations, far fewer than reinforcement-learning approaches like GRPO
- optimize_anything API to tune any text artifact: prompts, code, agent architectures, configs, or SVGs
- Integrates with DSPy as dspy.GEPA for optimizing AI pipelines
- Configurable task and reflection language models, so you can pair a cheaper task model with a stronger reflection model
Getting started
Install from PyPI, then optimize a seed prompt against a dataset with a metric you choose.
Install GEPA
Install the package from PyPI. To get the latest version, install directly from the GitHub main branch instead.
pip install gepaOptimize a system prompt
Provide a seed candidate, a train and validation set, a task model, and a reflection model. This example tunes a system prompt on the AIME math benchmark.
import gepa
trainset, valset, _ = gepa.examples.aime.init_dataset()
seed_prompt = {
"system_prompt": "You are a helpful assistant. Answer the question. "
"Put your final answer in the format '### <answer>'"
}
result = gepa.optimize(
seed_candidate=seed_prompt,
trainset=trainset,
valset=valset,
task_lm="openai/gpt-4.1-mini",
max_metric_calls=150,
reflection_lm="openai/gpt-5",
)
print("Optimized prompt:", result.best_candidate['system_prompt'])Use it inside DSPy (optional)
For AI pipelines, GEPA is available as dspy.GEPA. Pass your metric, then compile a DSPy program against your data.
import dspy
optimizer = dspy.GEPA(
metric=your_metric,
max_metric_calls=150,
reflection_lm="openai/gpt-5",
)
optimized_program = optimizer.compile(student=MyProgram(), trainset=trainset, valset=valset)Commands and code are distilled from the project's own documentation — always check the official repo for the latest.
When to use it
- Improve a system or task prompt against a benchmark or custom metric without hand-tuning by trial and error
- Optimize a DSPy AI program end to end using dspy.GEPA
- Tune non-prompt text artifacts such as code, agent architectures, or configuration files with the optimize_anything API
- Search for better-performing variants when reinforcement-learning approaches would need too many evaluations
How GEPA compares
GEPA alongside other open-source prompt programming tools AI/TLDR tracks, ranked by GitHub stars.
| Tool | Stars | What it does |
|---|---|---|
| DSPy | ★ 35.8k | A Stanford framework for programming language models with composable modules and automatic prompt optimization instead of hand-written prompts. |
| ell | ★ 5.9k | A Python library that treats prompts as versioned functions, with tooling to track, visualize, and iterate on them as code. |
| GEPA | ★ 5.5k | Optimize prompts and other text parameters using LLM reflection and evolutionary search |
| LMQL | ★ 4.2k | A query language for LLMs that mixes Python control flow with prompts and constraints to script multi-step generation. |
| AdalFlow | ★ 4.2k | A PyTorch-like library for building and auto-optimizing LLM pipelines, tuning prompts across the components of a task. |
| TextGrad | ★ 3.6k | A library that optimizes prompts and other text variables using textual gradients, applying a backpropagation-like loop driven by LLM feedback. |
| Mirascope | ★ 1.5k | A lightweight Python toolkit for writing LLM calls as typed functions with prompt templates, chaining, and a single interface across providers. |