GEPA

Optimize prompts and other text parameters using LLM reflection and evolutionary search

Overview

GEPA (Genetic-Pareto) is a Python framework that improves any system with textual parameters against an evaluation metric you define. Instead of collapsing a run into a single number, it uses a language model to read full execution traces (error messages, profiling data, reasoning logs) to work out why a candidate failed, then proposes targeted fixes through iterative reflection, mutation, and Pareto-aware selection.

It is built for developers who tune prompts and AI pipelines and want a systematic way to search for better variants with relatively few evaluations. The same approach extends beyond prompts to code, agent architectures, configurations, and other text artifacts through the optimize_anything API.

As a prompt-programming and LLM-orchestration tool, GEPA fits alongside DSPy, where it is available as dspy.GEPA for optimizing AI programs. You can use it standalone with its own gepa.optimize entry point or inside a DSPy pipeline.

What it does

Reflective optimization: an LLM reads full execution traces to diagnose failures rather than relying on a single scalar reward
Pareto-aware evolutionary search that keeps candidates excelling on different task subsets and evolves new variants from them
Reaches gains in roughly 100-500 evaluations, far fewer than reinforcement-learning approaches like GRPO
optimize_anything API to tune any text artifact: prompts, code, agent architectures, configs, or SVGs
Integrates with DSPy as dspy.GEPA for optimizing AI pipelines
Configurable task and reflection language models, so you can pair a cheaper task model with a stronger reflection model

Getting started

Install from PyPI, then optimize a seed prompt against a dataset with a metric you choose.

Install GEPA

Install the package from PyPI. To get the latest version, install directly from the GitHub main branch instead.

bashbash

pip install gepa

Optimize a system prompt

Provide a seed candidate, a train and validation set, a task model, and a reflection model. This example tunes a system prompt on the AIME math benchmark.

pythonpython

import gepa

trainset, valset, _ = gepa.examples.aime.init_dataset()

seed_prompt = {
    "system_prompt": "You are a helpful assistant. Answer the question. "
                     "Put your final answer in the format '### <answer>'"
}

result = gepa.optimize(
    seed_candidate=seed_prompt,
    trainset=trainset,
    valset=valset,
    task_lm="openai/gpt-4.1-mini",
    max_metric_calls=150,
    reflection_lm="openai/gpt-5",
)

print("Optimized prompt:", result.best_candidate['system_prompt'])

Use it inside DSPy (optional)

For AI pipelines, GEPA is available as dspy.GEPA. Pass your metric, then compile a DSPy program against your data.

pythonpython

import dspy

optimizer = dspy.GEPA(
    metric=your_metric,
    max_metric_calls=150,
    reflection_lm="openai/gpt-5",
)
optimized_program = optimizer.compile(student=MyProgram(), trainset=trainset, valset=valset)

Commands and code are distilled from the project's own documentation — always check the official repo for the latest.

When to use it

Improve a system or task prompt against a benchmark or custom metric without hand-tuning by trial and error
Optimize a DSPy AI program end to end using dspy.GEPA
Tune non-prompt text artifacts such as code, agent architectures, or configuration files with the optimize_anything API
Search for better-performing variants when reinforcement-learning approaches would need too many evaluations

How GEPA compares

GEPA alongside other open-source prompt programming tools AI/TLDR tracks, ranked by GitHub stars.

Tool	Stars	What it does
DSPy	★ 35.8k	A Stanford framework for programming language models with composable modules and automatic prompt optimization instead of hand-written prompts.
ell	★ 5.9k	A Python library that treats prompts as versioned functions, with tooling to track, visualize, and iterate on them as code.
GEPA	★ 5.5k	Optimize prompts and other text parameters using LLM reflection and evolutionary search
LMQL	★ 4.2k	A query language for LLMs that mixes Python control flow with prompts and constraints to script multi-step generation.
AdalFlow	★ 4.2k	A PyTorch-like library for building and auto-optimizing LLM pipelines, tuning prompts across the components of a task.
TextGrad	★ 3.6k	A library that optimizes prompts and other text variables using textual gradients, applying a backpropagation-like loop driven by LLM feedback.
Mirascope	★ 1.5k	A lightweight Python toolkit for writing LLM calls as typed functions with prompt templates, chaining, and a single interface across providers.

// Overview

// What it does

// Getting started