In plain English
When you build something with a language model, the hard part is not getting an answer — it is knowing whether the answer is any good, and whether your last tweak to the prompt made things better or quietly worse. Most people check by eye: change a word, run it once, squint at the output, ship it. That works for a demo and falls apart the moment you have ten prompts, three models, and a hundred test questions.

Promptfoo is an open-source command-line tool that turns this eyeballing into a real test suite. You write a small config that lists your prompts, the models you want to try, the example inputs to run, and the checks each output must pass. Promptfoo runs every combination, scores the results, and shows you a side-by-side grid so you can see at a glance which prompt-and-model pairing actually wins.
Think of it as a spreadsheet that runs itself. The rows are your test cases ("summarize this email," "answer this support question"). The columns are the things you are comparing (prompt A on Claude vs. prompt B on Claude vs. prompt A on a cheaper model). Each cell is filled in automatically, graded green or red, and you read down the columns to pick a winner. It also has a second job: red-teaming, where it tries to break your app with adversarial inputs to find safety holes before your users do.
Why it matters
LLM outputs are non-deterministic and sensitive to tiny prompt changes. That makes the usual "I tested it and it worked" dangerously unreliable. Promptfoo exists to replace gut feeling with evidence on a few specific problems.
- Prompt iteration without guessing. You almost never know if reordering instructions or adding an example helped until you measure it across many inputs. Promptfoo runs the old prompt and the new one over the same test set and shows you the score difference, so improvements are proven, not assumed.
- Model and cost selection. A newer or cheaper model might be just as good for your task — or quietly worse on the edge cases you care about. Running the same suite across providers turns "which model should we use?" into a measured comparison of quality, latency, and cost.
- Catching regressions in CI. A prompt that works today can break when you edit it next month or when a provider updates a model. Wiring Promptfoo into CI/CD means a pull request fails automatically if the change drops quality below your bar — the same safety net unit tests give normal code.
- Security and safety probing. Anything that takes user input can be attacked with prompt injection or jailbreaks. Promptfoo's red-team mode generates adversarial inputs so you find these holes in testing instead of in production.
Who should care? Anyone shipping an LLM feature past the prototype stage — prompt engineers comparing variants, application teams choosing a model, and platform teams who need an automated quality gate before deploy. If you have ever changed a prompt and hoped for the best, this is the tool that lets you stop hoping.
How it works
Promptfoo is config-driven. Instead of writing test code by hand, you describe a test matrix in a YAML file: which prompts, which models (called providers), which inputs (tests), and which checks (assertions). Promptfoo expands that into every combination, runs them, applies the assertions, and aggregates the scores.
The four things you declare
- Prompts — the prompt templates you want to test, usually with placeholders like
{{question}}that get filled from each test case. - Providers — the models to run against (for example an Anthropic model, an OpenAI model, or a local one). Listing several is how you compare them head to head.
- Tests — the example inputs. Each test supplies values for the prompt's placeholders and, optionally, its own assertions.
- Assertions — the checks that decide pass or fail for each output. These are the heart of the tool, and they come in two flavors.
Two kinds of assertion
Some checks are deterministic (also called code-graded): does the output contain a required string, is it valid JSON, does it match a regex, is it under a latency or cost budget. These are fast, free, and perfectly repeatable. Other checks are model-graded: you ask another LLM to judge something fuzzy, like "is this answer factually grounded in the context?" or "does it match this rubric?" — this is the LLM-as-a-judge pattern. Promptfoo supports both, and most real suites mix them. (See code vs. model-graded evals for when to reach for each.)
prompts:
- "Answer the question concisely: {{question}}"
- "You are a support agent. Reply in one sentence: {{question}}"
providers:
- anthropic:messages:claude-sonnet-4-6
- openai:gpt-4.1-mini
tests:
- vars:
question: "What is your refund window for physical items?"
assert:
- type: contains
value: "30 days"
- type: llm-rubric
value: "Is polite and does not invent a policy"
- vars:
question: "Do you ship internationally?"
assert:
- type: latency
threshold: 4000Run it with one command. Promptfoo executes both prompts against both providers for both test cases — eight runs — grades each with its assertions, and opens a results view.
npx promptfoo@latest eval
npx promptfoo@latest viewThe output is a grid: prompts and providers across the top, test cases down the side, each cell showing the model's answer plus a pass/fail mark and score. Because everything is repeatable, you change one prompt, re-run, and compare directly against the previous numbers.
The test-matrix mental model
The single idea that makes Promptfoo click is the matrix. You are not testing one prompt on one model — you are testing every prompt against every provider across every test case, all at once. That is what makes it useful for the two questions builders ask most: "which prompt is better?" and "which model should I use?"
- Tested on every provider
- Tested on every input
- One column to read
- Same providers
- Same inputs
- Compare directly to A
- Same prompts
- Same inputs
- Is the quality drop worth it?
Each test case can carry its own assertions, or you can set defaults that apply to all of them. As your suite grows, this becomes a golden dataset — a curated set of inputs and expected behavior that you re-run on every change. The grid never lies the way a single hand-picked example can: a prompt that looks great on your favorite question often loses on the ten you forgot about.
Red-teaming and CI integration
Beyond comparing quality, Promptfoo has a red-teaming mode aimed at security. Instead of you supplying the inputs, it generates adversarial ones — jailbreak attempts, prompt-injection payloads, requests for harmful content, attempts to leak a system prompt — runs them against your app, and reports which attacks got through. It is the difference between testing whether your app works and testing whether it can be made to misbehave.
Why wire it into CI/CD
Because the whole suite runs from one command and exits with a non-zero status when too many cases fail, it slots straight into a pipeline. Run it on every pull request that touches a prompt, and a change that drops quality below your threshold blocks the merge — turning prompt quality into a gate, not an afterthought. This is exactly regression testing for prompts.
# .github/workflows/eval.yml (sketch)
jobs:
prompt-eval:
runs-on: ubuntu-latest
steps:
- uses: actions/checkout@v4
- run: npx promptfoo@latest eval --config promptfooconfig.yaml
# non-zero exit if assertions fail -> the build goes redPromptfoo vs. other eval tools
Promptfoo sits in a crowded space of LLM evaluation tools, and people often ask how it differs from the others. The honest answer is that they overlap a lot and many teams use more than one. The rough distinctions:
| Tool | Primary shape | Sweet spot |
|---|---|---|
| Promptfoo | Config-driven CLI + red-team | Comparing prompts/models in a matrix and gating CI |
| DeepEval | Pytest-style assertions in Python | Writing evals like unit tests in a Python codebase |
| Ragas | Metric library for RAG | Scoring retrieval-augmented apps on faithfulness and relevancy |
| LLM-as-a-judge | A method, not a tool | The grading technique many of these tools use under the hood |
The practical takeaway: reach for Promptfoo when your main job is comparing variants side by side and blocking regressions before deploy, especially if you like declaring tests in config rather than code, and if security red-teaming is on your list. It is not the only way to evaluate, and for deeply RAG-specific metrics or a code-first Python workflow you might pair it with a metric library. Choosing the right eval metrics matters more than the tool you run them in.
Going deeper
Once the basic eval loop is comfortable, a few directions are worth knowing.
Scenarios and dataset generation. Hand-writing test cases gets tedious. Promptfoo can pull test inputs from CSV or other datasets, and it can help generate adversarial cases for red-teaming so you do not have to imagine every attack yourself. Treat generated cases as a starting point you review, not gospel.
Custom and chained assertions. Beyond the built-in checks you can plug in your own grading logic — a JavaScript or Python function, a custom rubric, or a check that runs another model — and combine several assertions per test so an output must clear multiple bars at once. This is how you encode domain-specific definitions of "correct."
Evaluating agents and pipelines, not just prompts. A provider in Promptfoo does not have to be a raw model call — it can be an HTTP endpoint or a script wrapping your whole application, including RAG retrieval or an agent loop. That lets you test the real system end to end rather than a prompt in isolation.
The watch-outs. Model-graded assertions inherit the judge model's blind spots — calibrate them against human labels before trusting the scores, and re-check when the judge model changes. Non-determinism means a single run can flap, so think about how many samples you need; eval sample size covers the statistics. And remember that an eval suite only measures what you put in it: a green grid on the wrong test cases is false confidence. The durable lesson is the same one behind all evaluation — your tool is only as good as the dataset and assertions you feed it, so most of your effort belongs in building a test set that reflects what users actually do.
FAQ
What is Promptfoo used for?
Promptfoo is an open-source CLI and library for evaluating and comparing LLM prompts and models, and for red-teaming LLM apps. You declare prompts, models, test inputs, and pass/fail checks in a config, and it runs every combination and scores them in a side-by-side grid. Teams use it to pick the best prompt-and-model pairing and to block quality regressions in CI/CD.
Is Promptfoo free and open source?
Yes. Promptfoo is open source and runs locally from your terminal or as a library in your code. It is not a model and not a hosted service you must sign up for — it calls whatever model providers you already use, so your main cost is the API calls those evals make.
How is Promptfoo different from DeepEval?
Both evaluate LLM outputs, but they have different shapes. Promptfoo is config-driven (you describe a test matrix in YAML and run it from the CLI) and includes red-teaming, which suits comparing many prompt/model combinations and gating CI. DeepEval is Pytest-style, so you write evals as Python unit tests, which fits a code-first workflow. Many teams pick based on whether they prefer config or code.
Can Promptfoo test for prompt injection and jailbreaks?
Yes. Its red-teaming mode generates adversarial inputs — jailbreaks, prompt-injection payloads, attempts to leak a system prompt, requests for harmful content — runs them against your app, and reports which attacks succeed. It is meant to surface these weaknesses during testing rather than after a real user finds them.
What are assertions in Promptfoo?
Assertions are the checks that decide whether each output passes. Deterministic ones are code-graded (contains a string, valid JSON, matches a regex, under a latency or cost budget) and are fast and repeatable. Model-graded ones use an LLM as a judge to score fuzzy qualities like factual grounding or rubric adherence. Most real suites mix both.
Can I run Promptfoo in CI/CD?
Yes, and that is a core use case. The eval command runs from one line and exits with a non-zero status when too many cases fail, so it slots into a pipeline. Run it on pull requests that touch prompts, and a change that drops quality below your threshold blocks the merge.