Giskard

Evals, red teaming, and test generation for agentic AI systems

github.com/Giskard-AI/giskard-oss★ 5.4k giskard.ai

Overview

Giskard is an open-source Python library for testing and evaluating agentic systems. It can wrap an LLM, a black-box agent, or a multi-step pipeline, and check how it behaves on inputs where the same prompt may produce different valid answers.

The v3 release is a modular set of focused packages. `giskard-checks` (Beta) handles scenario-based evals and LLM-as-judge assessments, `giskard-scan` (in progress) does adversarial red teaming, and `giskard-rag` (planned) covers RAG evaluation and synthetic test data. Each package only pulls in the dependencies it needs.

It fits the evaluation and testing category for teams who need to catch regressions, validate RAG groundedness, enforce safety rules, and test full multi-turn conversations rather than single exchanges.

What it does

Scenario API for writing evals against non-deterministic LLM outputs, including multi-turn agent conversations
Built-in checks: string matching, comparisons, regex, semantic similarity, and LLM-as-judge (Groundedness, Conformity, LLMJudge)
Vulnerability scanner that auto-generates adversarial test suites from a plain-language description of your agent
Red-teaming coverage for prompt injection, harmful content, stereotypes, and misinformation across OWASP LLM Top-10 categories
Async-first, lightweight design that drops heavy dependencies for better efficiency
Extensible with custom ScenarioGenerator instances for your own adversarial tests

Getting started

Install Giskard with pip, then write a scenario that runs your model and checks its output. Giskard requires Python 3.12+.

Install

Install the full Giskard package, or just the checks library if you only need evals.

bashbash

pip install giskard

Write and run a scenario check

Define a function that returns your model's output, then build a Scenario with a check such as Groundedness. The run() method is async, so wrap it with asyncio.run() in a script.

pythonpython

from openai import OpenAI
from giskard.checks import Scenario, Groundedness

client = OpenAI()

def get_answer(inputs: str) -> str:
    response = client.chat.completions.create(
        model="gpt-5-mini",
        messages=[{"role": "user", "content": inputs}],
    )
    return response.choices[0].message.content

scenario = (
    Scenario("test_dynamic_output")
    .interact(
        inputs="What is the capital of France?",
        outputs=get_answer,
    )
    .check(
        Groundedness(
            name="answer is grounded",
            context="France is a country in Western Europe. Its capital is Paris.",
        )
    )
)

result = await scenario.run()
result.print_report()

Scan an agent for vulnerabilities

Install giskard-scan and run vulnerability_scan against your agent with a plain-language description to generate adversarial test suites.

pythonpython

import asyncio
from giskard.scan import vulnerability_scan

async def main():
    await vulnerability_scan(
        target=my_agent,
        description="A customer support chatbot for an e-commerce platform.",
        languages=["en"],
    )

asyncio.run(main())

Commands and code are distilled from the project's own documentation — always check the official repo for the latest.

When to use it

Catch regressions in an LLM app by re-running scenario checks after each change
Validate that a RAG system's answers stay grounded in retrieved context
Red-team a customer-support agent for prompt injection and harmful content before launch
Test full multi-turn agent conversations rather than single prompt-response pairs

How Giskard compares

Giskard alongside other open-source evaluation & red-teaming tools AI/TLDR tracks, ranked by GitHub stars.

Tool	Stars	What it does
Strix	★ 26.1k	Strix runs autonomous AI agents that act like hackers, dynamically running your code to find vulnerabilities and validate them with real proof-of-concepts.
promptfoo	★ 22.4k	A developer-first CLI and library for testing and comparing prompts and models, with red-teaming probes for prompt injection, PII leaks, and other vulnerabilities.
OpenAI Evals	★ 18.7k	A framework and open registry for building and running evaluations of LLMs and LLM-based systems, including prompt chains and tool-using agents.
DeepEval	★ 16.3k	An open-source Python framework that tests LLM apps like unit tests, with 50+ metrics for RAG, agents, chatbots, and safety, and a Pytest integration for CI/CD.
Ragas	★ 14.4k	An evaluation toolkit focused on retrieval-augmented generation that scores answer faithfulness, context precision/recall, and relevancy, often without needing ground-truth labels.
Arize Phoenix	★ 10.2k	An open-source observability and evaluation tool for tracing LLM and agent behavior, running evals on traces, and troubleshooting issues in development and production.
garak	★ 8.2k	An LLM vulnerability scanner from NVIDIA with 100+ attack probes that test models for prompt injection, data leakage, jailbreaks, and other security weaknesses.
Giskard	★ 5.4k	Evals, red teaming, and test generation for agentic AI systems

// Overview

// What it does

// Getting started