Detoxify

Score text for toxicity, insults, threats, and hate speech with pretrained transformer models

github.com/unitaryai/detoxify★ 1.3k unitary.ai

Overview

Detoxify is an open-source Python library from Unitary that classifies text for toxicity. It wraps pretrained transformer models (built on Hugging Face Transformers and PyTorch Lightning) that return scores for categories such as toxicity, insults, threats, obscenity, and identity attacks.

It ships three main model variants trained on the Jigsaw competition datasets: `original` (Toxic Comment Classification), `unbiased` (trained to minimise unintended bias against identity mentions), and `multilingual` (covering English, French, Spanish, Italian, Portuguese, Turkish, and Russian). Smaller Albert-based variants (`original-small`, `unbiased-small`) are also available.

It fits the guardrails and content-moderation space: teams use it to flag harmful user content or to screen the inputs and outputs of LLM-powered apps. The authors note the models are intended for research and to assist human moderators, not as a sole automated judge.

What it does

Pretrained models that score text across multiple categories including toxicity, insults, threats, obscenity, and identity attacks
Three model families: `original`, `unbiased` (reduced identity bias), and `multilingual` (7 languages)
Lightweight `original-small` and `unbiased-small` Albert-based variants for smaller footprints
Simple `predict()` API that accepts a single string or a list of strings
Runs on CPU by default, with an optional `device` argument to target a GPU via any torch.device input
Built on Hugging Face Transformers and PyTorch Lightning

Getting started

Install the package from PyPI, then load a model and call predict() on your text.

Install Detoxify

Install the library with pip.

bashbash

pip install detoxify

Score some text

Load a model by name and pass a string or a list of strings to predict(). Each model returns a dictionary of category scores.

pythonpython

from detoxify import Detoxify

# each model takes in either a string or a list of strings
results = Detoxify('original').predict('example text')

results = Detoxify('unbiased').predict(['example text 1', 'example text 2'])

Choose a device (optional)

Models default to CPU. Pass a device argument to allocate the model elsewhere; it accepts any torch.device input.

pythonpython

model = Detoxify('original', device='cuda')

Display results as a table (optional)

Install pandas to print the per-category scores in a readable frame.

pythonpython

import pandas as pd
print(pd.DataFrame(results, index=input_text).round(5))

Commands and code are distilled from the project's own documentation — always check the official repo for the latest.

When to use it

Flagging toxic, abusive, or hateful user comments so human moderators can review them faster
Screening inputs and outputs of LLM-powered applications for harmful content
Scoring multilingual user content across the 7 languages the multilingual model supports
Research on toxicity detection and bias, or fine-tuning on a domain-specific dataset

How Detoxify compares

Detoxify alongside other open-source guardrails & security tools AI/TLDR tracks, ranked by GitHub stars.

Tool	Stars	What it does
Microsoft Presidio	★ 9.3k	A framework for detecting, redacting, masking, and anonymizing personal data (PII) in text, images, and structured data using NER models, regex, and rule-based recognizers.
Guardrails AI	★ 7k	A Python framework that wraps LLM calls with composable input/output validators (from the Guardrails Hub) to check structure, type, and safety risks before responses reach users.
NeMo Guardrails	★ 6.5k	NVIDIA's toolkit for adding programmable rails to LLM chat apps, using the Colang language to control dialog flow and block jailbreaks, prompt injection, and off-topic answers.
GLiNER	★ 3.3k	A small zero-shot named-entity recognition model that can extract arbitrary entity types from text and is widely used as a PII detection backend, including inside Presidio.
LLM Guard	★ 3.1k	A security toolkit from Protect AI with 35+ input and output scanners that sanitize prompts and responses for prompt injection, toxicity, PII leakage, and harmful content.
Rebuff	★ 1.5k	A prompt injection detector that combines heuristics, an LLM-based classifier, a vector store of past attacks, and canary tokens to catch attempts to subvert an LLM application.
Detoxify	★ 1.3k	Score text for toxicity, insults, threats, and hate speech with pretrained transformer models
Vigil	★ 482	A Python library and REST API that scans LLM prompts and responses with YARA rules, transformer classifiers, and vector similarity to flag prompt injections and jailbreaks.

// Overview

// What it does

// Getting started

Install Detoxify

Score some text

Choose a device (optional)

Display results as a table (optional)

// When to use it

// How Detoxify compares

Overview

What it does

Getting started

When to use it

How Detoxify compares