promptfoo

Test, compare, and red-team your LLM prompts and apps from the command line

github.com/promptfoo/promptfoo★ 22.4k promptfoo.dev

Overview

promptfoo is an open-source CLI and library for evaluating and red-teaming LLM applications. Instead of tweaking prompts by trial and error, you write test cases and run automated evals to see, with metrics, how different prompts and models behave on the same inputs.

It is built for developers who ship LLM features and want repeatable checks rather than gut-feel comparisons. Evals run fully on your machine, so your prompts and data never leave your environment, and the tool works with any LLM API and any programming language.

As an eval framework, it covers two jobs in one: side-by-side model and prompt comparison, plus red-teaming probes that scan for issues like prompt injection and PII leaks. It plugs into CI/CD so the same checks run on every change.

What it does

Automated evaluations of prompts and models with metrics you can compare
Side-by-side comparison across providers including OpenAI, Anthropic, Azure, Bedrock, and Ollama
Red teaming and vulnerability scanning for prompt injection, PII leaks, and other risks
Runs evals 100% locally, so prompts stay on your machine
CI/CD integration to automate checks on every change
Available as a CLI and a Node.js library, with live reload and caching

Getting started

Install promptfoo, scaffold an example project, set a provider key, then run an eval and view the results. Requires Node.js ^20.20.0 or >=22.22.0 for npm/npx usage.

Install and scaffold an example

Install the CLI globally and initialize the getting-started example. You can also use brew install promptfoo, pip install promptfoo, or run any command with npx promptfoo@latest without installing.

bashbash

npm install -g promptfoo
promptfoo init --example getting-started

Set your provider API key

Most LLM providers require an API key. Set yours as an environment variable before running an eval.

bashbash

export OPENAI_API_KEY=sk-abc123

Run an eval and view results

Move into the example directory, run the evaluation, then open the local viewer to inspect the results.

bashbash

cd getting-started
promptfoo eval
promptfoo view

Commands and code are distilled from the project's own documentation — always check the official repo for the latest.

When to use it

Compare several prompts or models side-by-side on the same test cases before choosing one
Red-team an LLM app to scan for prompt injection, PII leaks, and other vulnerabilities
Add automated prompt and model checks to a CI/CD pipeline so regressions fail the build
Catch quality drops when upgrading or swapping the underlying model behind a feature

How promptfoo compares

promptfoo alongside other open-source evaluation & red-teaming tools AI/TLDR tracks, ranked by GitHub stars.

Tool	Stars	What it does
Strix	★ 26.1k	Strix runs autonomous AI agents that act like hackers, dynamically running your code to find vulnerabilities and validate them with real proof-of-concepts.
promptfoo	★ 22.4k	Test, compare, and red-team your LLM prompts and apps from the command line
OpenAI Evals	★ 18.7k	A framework and open registry for building and running evaluations of LLMs and LLM-based systems, including prompt chains and tool-using agents.
DeepEval	★ 16.3k	An open-source Python framework that tests LLM apps like unit tests, with 50+ metrics for RAG, agents, chatbots, and safety, and a Pytest integration for CI/CD.
Ragas	★ 14.4k	An evaluation toolkit focused on retrieval-augmented generation that scores answer faithfulness, context precision/recall, and relevancy, often without needing ground-truth labels.
Arize Phoenix	★ 10.2k	An open-source observability and evaluation tool for tracing LLM and agent behavior, running evals on traces, and troubleshooting issues in development and production.
garak	★ 8.2k	An LLM vulnerability scanner from NVIDIA with 100+ attack probes that test models for prompt injection, data leakage, jailbreaks, and other security weaknesses.
Giskard	★ 5.4k	An open-source library for testing and scanning LLM and ML models for issues like hallucination, bias, and toxicity, including multi-turn agent testing and a vulnerability scanner.

// Overview

// What it does

// Getting started

Install and scaffold an example

Set your provider API key

Run an eval and view results

// When to use it

// How promptfoo compares

Overview

What it does

Getting started

When to use it

How promptfoo compares