OpenAI · 2026-04-22 · notable

OpenAI Privacy Filter — Apache 2.0 PII Detection Model for On-Prem Data Pipelines

Item: OpenAI Privacy Filter — Apache 2.0 PII Detection Model for On-Prem Data Pipelines
Rating: 3
Author: AI/TLDR

OpenAI releases an Apache 2.0 PII detection model: 1.5B params, 50M active, 128k context, detects 8 PII categories via BIOES token classification. Designed for on-prem data sanitization — no data leaves your stack.

OpenAI privacy-filter GitHub repository — Apache 2.0 PII detection model for on-premises data sanitization

A small, fast Apache 2.0 model from OpenAI for detecting and masking PII in text — designed to run entirely on-premises.

Key specs

Parameters	1.5B total, 50M active
Context window	128,000 tokens
GitHub stars	204
Pii categories	8
Hugging face likes	168

What is it?

OpenAI Privacy Filter is a token-classification model that identifies 8 types of personally identifiable information — names, private addresses, emails, phone numbers, account numbers, URLs, dates, and secrets — in long text documents. It outputs BIOES span labels (Begin, Inside, Outside, End, Single) and processes sequences in a single forward pass. Apache 2.0 licensed and designed to run on-prem without sending data to any external API.

How does it work?

The model has 1.5B parameters with 50M active during inference, keeping latency low despite the large nominal size. A 128k-token context window lets it process long documents without chunking. Constrained Viterbi decoding ensures coherent BIOES labels. Precision/recall tradeoffs are configurable at runtime via a threshold parameter.

Why does it matter?

Enterprises doing LLM fine-tuning or building RAG pipelines on sensitive data must strip PII before it touches a cloud model. On-prem PII detection has historically meant expensive proprietary tools or brittle regex heuristics. An Apache 2.0 model from a recognized lab that runs locally removes a real compliance barrier for teams working with healthcare, legal, or financial data.

Who is it for?

Data engineers and ML teams doing LLM fine-tuning or RAG on enterprise data with PII exposure constraints.

Try it

pip install transformers && from transformers import pipeline; p = pipeline('token-classification', model='openai/privacy-filter'); p('My name is Alice Smith')