AI/TLDR

Microsoft Presidio

Detect and de-identify PII in text, images, and structured data

Overview

Microsoft Presidio is an open-source SDK for finding and removing personal data (PII) such as names, phone numbers, credit card numbers, locations, and social security numbers. It splits the job into two main packages: an Analyzer that identifies where PII appears in a piece of text, and an Anonymizer that replaces, masks, or redacts those spans. Separate packages handle redacting PII in images (including DICOM medical images) and in structured data.

It is aimed at teams that need to keep sensitive data out of logs, datasets, or model prompts. Detection combines Named Entity Recognition (NER) models, regular expressions, rule-based logic, and checksums, and you can add your own custom recognizers when the built-in ones do not cover an entity you care about.

As a guardrails and PII-security tool, Presidio fits into data pipelines and pre-processing steps. It runs from Python or PySpark and can be deployed with Docker or Kubernetes. The maintainers note that automated detection has no guarantee of finding every piece of sensitive data, so it should be paired with other safeguards.

What it does

  • Predefined and custom PII recognizers built on NER, regular expressions, rule-based logic, and checksums, with context awareness in multiple languages
  • Anonymizer module to replace, mask, redact, or otherwise transform detected PII spans using configurable operators
  • Image redaction for standard image formats and DICOM medical images
  • Support for structured and tabular data through the presidio-structured package
  • Options to connect to external PII detection models instead of, or alongside, the built-in ones
  • Multiple deployment paths: Python or PySpark workloads, Docker containers, and Kubernetes

Getting started

Presidio is distributed as separate pip packages. Install the analyzer and anonymizer, download a spaCy model for NER, then detect and de-identify PII in a string.

Install the packages and NLP model

Install the analyzer and anonymizer, then download the default spaCy English model used for named-entity recognition.

bashbash
pip install presidio_analyzer
pip install presidio_anonymizer
python -m spacy download en_core_web_lg

Detect PII with the Analyzer

Create an AnalyzerEngine and call analyze() with the text, the entities to look for, and the language.

pythonpython
from presidio_analyzer import AnalyzerEngine

analyzer = AnalyzerEngine()

results = analyzer.analyze(text="My phone number is 212-555-5555",
                            entities=["PHONE_NUMBER"],
                            language='en')
print(results)

Anonymize the detected spans

Pass detection results to the AnonymizerEngine and supply operators that decide how each entity type is transformed.

pythonpython
from presidio_anonymizer import AnonymizerEngine
from presidio_anonymizer.entities import RecognizerResult, OperatorConfig

engine = AnonymizerEngine()

result = engine.anonymize(
    text="My name is Bond, James Bond",
    analyzer_results=[
        RecognizerResult(entity_type="PERSON", start=11, end=15, score=0.8),
        RecognizerResult(entity_type="PERSON", start=17, end=27, score=0.8),
    ],
    operators={"PERSON": OperatorConfig("replace", {"new_value": "BIP"})},
)

print(result)

Commands and code are distilled from the project's own documentation — always check the official repo for the latest.

When to use it

  • Scrub PII from application logs, support transcripts, or datasets before they are stored or shared
  • Pre-process prompts and documents to remove personal data before sending them to an LLM
  • Redact names, IDs, and other identifiers from images and DICOM medical scans
  • De-identify columns in structured or tabular data as part of a data pipeline

How Microsoft Presidio compares

Microsoft Presidio alongside other open-source guardrails & security tools AI/TLDR tracks, ranked by GitHub stars.

ToolStarsWhat it does
Microsoft Presidio★ 9.3kDetect and de-identify PII in text, images, and structured data
Guardrails AI★ 7kA Python framework that wraps LLM calls with composable input/output validators (from the Guardrails Hub) to check structure, type, and safety risks before responses reach users.
NeMo Guardrails★ 6.5kNVIDIA's toolkit for adding programmable rails to LLM chat apps, using the Colang language to control dialog flow and block jailbreaks, prompt injection, and off-topic answers.
GLiNER★ 3.3kA small zero-shot named-entity recognition model that can extract arbitrary entity types from text and is widely used as a PII detection backend, including inside Presidio.
LLM Guard★ 3.1kA security toolkit from Protect AI with 35+ input and output scanners that sanitize prompts and responses for prompt injection, toxicity, PII leakage, and harmful content.
Rebuff★ 1.5kA prompt injection detector that combines heuristics, an LLM-based classifier, a vector store of past attacks, and canary tokens to catch attempts to subvert an LLM application.
Detoxify★ 1.3kPretrained transformer models from Unitary that score text for toxicity, insults, threats, and hate speech, often used to moderate LLM inputs and outputs.
Vigil★ 482A Python library and REST API that scans LLM prompts and responses with YARA rules, transformer classifiers, and vector similarity to flag prompt injections and jailbreaks.