AI/TLDR

GEPA

Optimize prompts and other text parameters using LLM reflection and evolutionary search

Overview

GEPA (Genetic-Pareto) is a Python framework that improves any system with textual parameters against an evaluation metric you define. Instead of collapsing a run into a single number, it uses a language model to read full execution traces (error messages, profiling data, reasoning logs) to work out why a candidate failed, then proposes targeted fixes through iterative reflection, mutation, and Pareto-aware selection.

It is built for developers who tune prompts and AI pipelines and want a systematic way to search for better variants with relatively few evaluations. The same approach extends beyond prompts to code, agent architectures, configurations, and other text artifacts through the optimize_anything API.

As a prompt-programming and LLM-orchestration tool, GEPA fits alongside DSPy, where it is available as dspy.GEPA for optimizing AI programs. You can use it standalone with its own gepa.optimize entry point or inside a DSPy pipeline.

What it does

  • Reflective optimization: an LLM reads full execution traces to diagnose failures rather than relying on a single scalar reward
  • Pareto-aware evolutionary search that keeps candidates excelling on different task subsets and evolves new variants from them
  • Reaches gains in roughly 100-500 evaluations, far fewer than reinforcement-learning approaches like GRPO
  • optimize_anything API to tune any text artifact: prompts, code, agent architectures, configs, or SVGs
  • Integrates with DSPy as dspy.GEPA for optimizing AI pipelines
  • Configurable task and reflection language models, so you can pair a cheaper task model with a stronger reflection model

Getting started

Install from PyPI, then optimize a seed prompt against a dataset with a metric you choose.

Install GEPA

Install the package from PyPI. To get the latest version, install directly from the GitHub main branch instead.

bashbash
pip install gepa

Optimize a system prompt

Provide a seed candidate, a train and validation set, a task model, and a reflection model. This example tunes a system prompt on the AIME math benchmark.

pythonpython
import gepa

trainset, valset, _ = gepa.examples.aime.init_dataset()

seed_prompt = {
    "system_prompt": "You are a helpful assistant. Answer the question. "
                     "Put your final answer in the format '### <answer>'"
}

result = gepa.optimize(
    seed_candidate=seed_prompt,
    trainset=trainset,
    valset=valset,
    task_lm="openai/gpt-4.1-mini",
    max_metric_calls=150,
    reflection_lm="openai/gpt-5",
)

print("Optimized prompt:", result.best_candidate['system_prompt'])

Use it inside DSPy (optional)

For AI pipelines, GEPA is available as dspy.GEPA. Pass your metric, then compile a DSPy program against your data.

pythonpython
import dspy

optimizer = dspy.GEPA(
    metric=your_metric,
    max_metric_calls=150,
    reflection_lm="openai/gpt-5",
)
optimized_program = optimizer.compile(student=MyProgram(), trainset=trainset, valset=valset)

Commands and code are distilled from the project's own documentation — always check the official repo for the latest.

When to use it

  • Improve a system or task prompt against a benchmark or custom metric without hand-tuning by trial and error
  • Optimize a DSPy AI program end to end using dspy.GEPA
  • Tune non-prompt text artifacts such as code, agent architectures, or configuration files with the optimize_anything API
  • Search for better-performing variants when reinforcement-learning approaches would need too many evaluations

How GEPA compares

GEPA alongside other open-source prompt programming tools AI/TLDR tracks, ranked by GitHub stars.

ToolStarsWhat it does
DSPy★ 35.8kA Stanford framework for programming language models with composable modules and automatic prompt optimization instead of hand-written prompts.
ell★ 5.9kA Python library that treats prompts as versioned functions, with tooling to track, visualize, and iterate on them as code.
GEPA★ 5.5kOptimize prompts and other text parameters using LLM reflection and evolutionary search
LMQL★ 4.2kA query language for LLMs that mixes Python control flow with prompts and constraints to script multi-step generation.
AdalFlow★ 4.2kA PyTorch-like library for building and auto-optimizing LLM pipelines, tuning prompts across the components of a task.
TextGrad★ 3.6kA library that optimizes prompts and other text variables using textual gradients, applying a backpropagation-like loop driven by LLM feedback.
Mirascope★ 1.5kA lightweight Python toolkit for writing LLM calls as typed functions with prompt templates, chaining, and a single interface across providers.