Overview
Giskard is an open-source Python library for testing and evaluating agentic systems. It can wrap an LLM, a black-box agent, or a multi-step pipeline, and check how it behaves on inputs where the same prompt may produce different valid answers.
The v3 release is a modular set of focused packages. `giskard-checks` (Beta) handles scenario-based evals and LLM-as-judge assessments, `giskard-scan` (in progress) does adversarial red teaming, and `giskard-rag` (planned) covers RAG evaluation and synthetic test data. Each package only pulls in the dependencies it needs.
It fits the evaluation and testing category for teams who need to catch regressions, validate RAG groundedness, enforce safety rules, and test full multi-turn conversations rather than single exchanges.
What it does
- Scenario API for writing evals against non-deterministic LLM outputs, including multi-turn agent conversations
- Built-in checks: string matching, comparisons, regex, semantic similarity, and LLM-as-judge (Groundedness, Conformity, LLMJudge)
- Vulnerability scanner that auto-generates adversarial test suites from a plain-language description of your agent
- Red-teaming coverage for prompt injection, harmful content, stereotypes, and misinformation across OWASP LLM Top-10 categories
- Async-first, lightweight design that drops heavy dependencies for better efficiency
- Extensible with custom ScenarioGenerator instances for your own adversarial tests
Getting started
Install Giskard with pip, then write a scenario that runs your model and checks its output. Giskard requires Python 3.12+.
Install
Install the full Giskard package, or just the checks library if you only need evals.
pip install giskardWrite and run a scenario check
Define a function that returns your model's output, then build a Scenario with a check such as Groundedness. The run() method is async, so wrap it with asyncio.run() in a script.
from openai import OpenAI
from giskard.checks import Scenario, Groundedness
client = OpenAI()
def get_answer(inputs: str) -> str:
response = client.chat.completions.create(
model="gpt-5-mini",
messages=[{"role": "user", "content": inputs}],
)
return response.choices[0].message.content
scenario = (
Scenario("test_dynamic_output")
.interact(
inputs="What is the capital of France?",
outputs=get_answer,
)
.check(
Groundedness(
name="answer is grounded",
context="France is a country in Western Europe. Its capital is Paris.",
)
)
)
result = await scenario.run()
result.print_report()Scan an agent for vulnerabilities
Install giskard-scan and run vulnerability_scan against your agent with a plain-language description to generate adversarial test suites.
import asyncio
from giskard.scan import vulnerability_scan
async def main():
await vulnerability_scan(
target=my_agent,
description="A customer support chatbot for an e-commerce platform.",
languages=["en"],
)
asyncio.run(main())Commands and code are distilled from the project's own documentation — always check the official repo for the latest.
When to use it
- Catch regressions in an LLM app by re-running scenario checks after each change
- Validate that a RAG system's answers stay grounded in retrieved context
- Red-team a customer-support agent for prompt injection and harmful content before launch
- Test full multi-turn agent conversations rather than single prompt-response pairs
How Giskard compares
Giskard alongside other open-source evaluation & red-teaming tools AI/TLDR tracks, ranked by GitHub stars.
| Tool | Stars | What it does |
|---|---|---|
| Strix | ★ 26.1k | Strix runs autonomous AI agents that act like hackers, dynamically running your code to find vulnerabilities and validate them with real proof-of-concepts. |
| promptfoo | ★ 22.4k | A developer-first CLI and library for testing and comparing prompts and models, with red-teaming probes for prompt injection, PII leaks, and other vulnerabilities. |
| OpenAI Evals | ★ 18.7k | A framework and open registry for building and running evaluations of LLMs and LLM-based systems, including prompt chains and tool-using agents. |
| DeepEval | ★ 16.3k | An open-source Python framework that tests LLM apps like unit tests, with 50+ metrics for RAG, agents, chatbots, and safety, and a Pytest integration for CI/CD. |
| Ragas | ★ 14.4k | An evaluation toolkit focused on retrieval-augmented generation that scores answer faithfulness, context precision/recall, and relevancy, often without needing ground-truth labels. |
| Arize Phoenix | ★ 10.2k | An open-source observability and evaluation tool for tracing LLM and agent behavior, running evals on traces, and troubleshooting issues in development and production. |
| garak | ★ 8.2k | An LLM vulnerability scanner from NVIDIA with 100+ attack probes that test models for prompt injection, data leakage, jailbreaks, and other security weaknesses. |
| Giskard | ★ 5.4k | Evals, red teaming, and test generation for agentic AI systems |