AI/TLDR

Detoxify

Score text for toxicity, insults, threats, and hate speech with pretrained transformer models

Overview

Detoxify is an open-source Python library from Unitary that classifies text for toxicity. It wraps pretrained transformer models (built on Hugging Face Transformers and PyTorch Lightning) that return scores for categories such as toxicity, insults, threats, obscenity, and identity attacks.

It ships three main model variants trained on the Jigsaw competition datasets: `original` (Toxic Comment Classification), `unbiased` (trained to minimise unintended bias against identity mentions), and `multilingual` (covering English, French, Spanish, Italian, Portuguese, Turkish, and Russian). Smaller Albert-based variants (`original-small`, `unbiased-small`) are also available.

It fits the guardrails and content-moderation space: teams use it to flag harmful user content or to screen the inputs and outputs of LLM-powered apps. The authors note the models are intended for research and to assist human moderators, not as a sole automated judge.

What it does

  • Pretrained models that score text across multiple categories including toxicity, insults, threats, obscenity, and identity attacks
  • Three model families: `original`, `unbiased` (reduced identity bias), and `multilingual` (7 languages)
  • Lightweight `original-small` and `unbiased-small` Albert-based variants for smaller footprints
  • Simple `predict()` API that accepts a single string or a list of strings
  • Runs on CPU by default, with an optional `device` argument to target a GPU via any torch.device input
  • Built on Hugging Face Transformers and PyTorch Lightning

Getting started

Install the package from PyPI, then load a model and call predict() on your text.

Install Detoxify

Install the library with pip.

bashbash
pip install detoxify

Score some text

Load a model by name and pass a string or a list of strings to predict(). Each model returns a dictionary of category scores.

pythonpython
from detoxify import Detoxify

# each model takes in either a string or a list of strings
results = Detoxify('original').predict('example text')

results = Detoxify('unbiased').predict(['example text 1', 'example text 2'])

Choose a device (optional)

Models default to CPU. Pass a device argument to allocate the model elsewhere; it accepts any torch.device input.

pythonpython
model = Detoxify('original', device='cuda')

Display results as a table (optional)

Install pandas to print the per-category scores in a readable frame.

pythonpython
import pandas as pd
print(pd.DataFrame(results, index=input_text).round(5))

Commands and code are distilled from the project's own documentation — always check the official repo for the latest.

When to use it

  • Flagging toxic, abusive, or hateful user comments so human moderators can review them faster
  • Screening inputs and outputs of LLM-powered applications for harmful content
  • Scoring multilingual user content across the 7 languages the multilingual model supports
  • Research on toxicity detection and bias, or fine-tuning on a domain-specific dataset

How Detoxify compares

Detoxify alongside other open-source guardrails & security tools AI/TLDR tracks, ranked by GitHub stars.

ToolStarsWhat it does
Microsoft Presidio★ 9.3kA framework for detecting, redacting, masking, and anonymizing personal data (PII) in text, images, and structured data using NER models, regex, and rule-based recognizers.
Guardrails AI★ 7kA Python framework that wraps LLM calls with composable input/output validators (from the Guardrails Hub) to check structure, type, and safety risks before responses reach users.
NeMo Guardrails★ 6.5kNVIDIA's toolkit for adding programmable rails to LLM chat apps, using the Colang language to control dialog flow and block jailbreaks, prompt injection, and off-topic answers.
GLiNER★ 3.3kA small zero-shot named-entity recognition model that can extract arbitrary entity types from text and is widely used as a PII detection backend, including inside Presidio.
LLM Guard★ 3.1kA security toolkit from Protect AI with 35+ input and output scanners that sanitize prompts and responses for prompt injection, toxicity, PII leakage, and harmful content.
Rebuff★ 1.5kA prompt injection detector that combines heuristics, an LLM-based classifier, a vector store of past attacks, and canary tokens to catch attempts to subvert an LLM application.
Detoxify★ 1.3kScore text for toxicity, insults, threats, and hate speech with pretrained transformer models
Vigil★ 482A Python library and REST API that scans LLM prompts and responses with YARA rules, transformer classifiers, and vector similarity to flag prompt injections and jailbreaks.