Microsoft Presidio

Detect and de-identify PII in text, images, and structured data

github.com/microsoft/presidio★ 9.3k microsoft.github.io/presidio

Overview

Microsoft Presidio is an open-source SDK for finding and removing personal data (PII) such as names, phone numbers, credit card numbers, locations, and social security numbers. It splits the job into two main packages: an Analyzer that identifies where PII appears in a piece of text, and an Anonymizer that replaces, masks, or redacts those spans. Separate packages handle redacting PII in images (including DICOM medical images) and in structured data.

It is aimed at teams that need to keep sensitive data out of logs, datasets, or model prompts. Detection combines Named Entity Recognition (NER) models, regular expressions, rule-based logic, and checksums, and you can add your own custom recognizers when the built-in ones do not cover an entity you care about.

As a guardrails and PII-security tool, Presidio fits into data pipelines and pre-processing steps. It runs from Python or PySpark and can be deployed with Docker or Kubernetes. The maintainers note that automated detection has no guarantee of finding every piece of sensitive data, so it should be paired with other safeguards.

What it does

Predefined and custom PII recognizers built on NER, regular expressions, rule-based logic, and checksums, with context awareness in multiple languages
Anonymizer module to replace, mask, redact, or otherwise transform detected PII spans using configurable operators
Image redaction for standard image formats and DICOM medical images
Support for structured and tabular data through the presidio-structured package
Options to connect to external PII detection models instead of, or alongside, the built-in ones
Multiple deployment paths: Python or PySpark workloads, Docker containers, and Kubernetes

Getting started

Presidio is distributed as separate pip packages. Install the analyzer and anonymizer, download a spaCy model for NER, then detect and de-identify PII in a string.

Install the packages and NLP model

Install the analyzer and anonymizer, then download the default spaCy English model used for named-entity recognition.

bashbash

pip install presidio_analyzer
pip install presidio_anonymizer
python -m spacy download en_core_web_lg

Detect PII with the Analyzer

Create an AnalyzerEngine and call analyze() with the text, the entities to look for, and the language.

pythonpython

from presidio_analyzer import AnalyzerEngine

analyzer = AnalyzerEngine()

results = analyzer.analyze(text="My phone number is 212-555-5555",
                            entities=["PHONE_NUMBER"],
                            language='en')
print(results)

Anonymize the detected spans

Pass detection results to the AnonymizerEngine and supply operators that decide how each entity type is transformed.

pythonpython

from presidio_anonymizer import AnonymizerEngine
from presidio_anonymizer.entities import RecognizerResult, OperatorConfig

engine = AnonymizerEngine()

result = engine.anonymize(
    text="My name is Bond, James Bond",
    analyzer_results=[
        RecognizerResult(entity_type="PERSON", start=11, end=15, score=0.8),
        RecognizerResult(entity_type="PERSON", start=17, end=27, score=0.8),
    ],
    operators={"PERSON": OperatorConfig("replace", {"new_value": "BIP"})},
)

print(result)

Commands and code are distilled from the project's own documentation — always check the official repo for the latest.

When to use it

Scrub PII from application logs, support transcripts, or datasets before they are stored or shared
Pre-process prompts and documents to remove personal data before sending them to an LLM
Redact names, IDs, and other identifiers from images and DICOM medical scans
De-identify columns in structured or tabular data as part of a data pipeline

How Microsoft Presidio compares

Microsoft Presidio alongside other open-source guardrails & security tools AI/TLDR tracks, ranked by GitHub stars.

Tool	Stars	What it does
Microsoft Presidio	★ 9.3k	Detect and de-identify PII in text, images, and structured data
Guardrails AI	★ 7k	A Python framework that wraps LLM calls with composable input/output validators (from the Guardrails Hub) to check structure, type, and safety risks before responses reach users.
NeMo Guardrails	★ 6.5k	NVIDIA's toolkit for adding programmable rails to LLM chat apps, using the Colang language to control dialog flow and block jailbreaks, prompt injection, and off-topic answers.
GLiNER	★ 3.3k	A small zero-shot named-entity recognition model that can extract arbitrary entity types from text and is widely used as a PII detection backend, including inside Presidio.
LLM Guard	★ 3.1k	A security toolkit from Protect AI with 35+ input and output scanners that sanitize prompts and responses for prompt injection, toxicity, PII leakage, and harmful content.
Rebuff	★ 1.5k	A prompt injection detector that combines heuristics, an LLM-based classifier, a vector store of past attacks, and canary tokens to catch attempts to subvert an LLM application.
Detoxify	★ 1.3k	Pretrained transformer models from Unitary that score text for toxicity, insults, threats, and hate speech, often used to moderate LLM inputs and outputs.
Vigil	★ 482	A Python library and REST API that scans LLM prompts and responses with YARA rules, transformer classifiers, and vector similarity to flag prompt injections and jailbreaks.

// Overview

// What it does

// Getting started