Overview
promptfoo is an open-source CLI and library for evaluating and red-teaming LLM applications. Instead of tweaking prompts by trial and error, you write test cases and run automated evals to see, with metrics, how different prompts and models behave on the same inputs.
It is built for developers who ship LLM features and want repeatable checks rather than gut-feel comparisons. Evals run fully on your machine, so your prompts and data never leave your environment, and the tool works with any LLM API and any programming language.
As an eval framework, it covers two jobs in one: side-by-side model and prompt comparison, plus red-teaming probes that scan for issues like prompt injection and PII leaks. It plugs into CI/CD so the same checks run on every change.
What it does
- Automated evaluations of prompts and models with metrics you can compare
- Side-by-side comparison across providers including OpenAI, Anthropic, Azure, Bedrock, and Ollama
- Red teaming and vulnerability scanning for prompt injection, PII leaks, and other risks
- Runs evals 100% locally, so prompts stay on your machine
- CI/CD integration to automate checks on every change
- Available as a CLI and a Node.js library, with live reload and caching
Getting started
Install promptfoo, scaffold an example project, set a provider key, then run an eval and view the results. Requires Node.js ^20.20.0 or >=22.22.0 for npm/npx usage.
Install and scaffold an example
Install the CLI globally and initialize the getting-started example. You can also use brew install promptfoo, pip install promptfoo, or run any command with npx promptfoo@latest without installing.
npm install -g promptfoo
promptfoo init --example getting-startedSet your provider API key
Most LLM providers require an API key. Set yours as an environment variable before running an eval.
export OPENAI_API_KEY=sk-abc123Run an eval and view results
Move into the example directory, run the evaluation, then open the local viewer to inspect the results.
cd getting-started
promptfoo eval
promptfoo viewCommands and code are distilled from the project's own documentation — always check the official repo for the latest.
When to use it
- Compare several prompts or models side-by-side on the same test cases before choosing one
- Red-team an LLM app to scan for prompt injection, PII leaks, and other vulnerabilities
- Add automated prompt and model checks to a CI/CD pipeline so regressions fail the build
- Catch quality drops when upgrading or swapping the underlying model behind a feature
How promptfoo compares
promptfoo alongside other open-source evaluation & red-teaming tools AI/TLDR tracks, ranked by GitHub stars.
| Tool | Stars | What it does |
|---|---|---|
| Strix | ★ 26.1k | Strix runs autonomous AI agents that act like hackers, dynamically running your code to find vulnerabilities and validate them with real proof-of-concepts. |
| promptfoo | ★ 22.4k | Test, compare, and red-team your LLM prompts and apps from the command line |
| OpenAI Evals | ★ 18.7k | A framework and open registry for building and running evaluations of LLMs and LLM-based systems, including prompt chains and tool-using agents. |
| DeepEval | ★ 16.3k | An open-source Python framework that tests LLM apps like unit tests, with 50+ metrics for RAG, agents, chatbots, and safety, and a Pytest integration for CI/CD. |
| Ragas | ★ 14.4k | An evaluation toolkit focused on retrieval-augmented generation that scores answer faithfulness, context precision/recall, and relevancy, often without needing ground-truth labels. |
| Arize Phoenix | ★ 10.2k | An open-source observability and evaluation tool for tracing LLM and agent behavior, running evals on traces, and troubleshooting issues in development and production. |
| garak | ★ 8.2k | An LLM vulnerability scanner from NVIDIA with 100+ attack probes that test models for prompt injection, data leakage, jailbreaks, and other security weaknesses. |
| Giskard | ★ 5.4k | An open-source library for testing and scanning LLM and ML models for issues like hallucination, bias, and toxicity, including multi-turn agent testing and a vulnerability scanner. |