Do LLM Providers Train on Your API Data? Privacy Explained

Get a straight answer on what happens to data you send an LLM API — retention windows, training policies, and the enterprise switches that change them.

BEGINNER12 MIN READUPDATED 2026-06-12

In plain English

Every time your code sends a prompt to an LLM API — a customer question, a document to summarize, a chunk of source code to review — that text travels to a server you don't control. A perfectly natural question follows: does the provider read it, keep it, or feed it into the next version of the model?

Do LLM Providers Train on Your API Data — diagram — Do LLM Providers Train on Your API Data — langtail.com

The short answer for the three major providers (OpenAI, Anthropic, Google) is the same: API data is not used for training by default. The API is a commercial product aimed at developers and businesses, and those customers would walk away if their proprietary data quietly trained public models. The longer answer — how long data is retained, who can see it, and how to get even stronger guarantees — is what this article unpacks.

A useful analogy is a law firm that does contract review for hundreds of clients. The firm keeps copies of each brief for a short time in case a dispute arises, but it would never paste a client's contract into advice it sells to a competitor. The 'keep a copy briefly for safety' part maps to data retention. The 'never use it to help a competitor' part maps to the no-training guarantee. Both are real and distinct — you can have one without the other.

Why it matters

For a hobbyist querying the API with toy prompts, data privacy is a minor concern. But most real applications send sensitive data: internal support tickets, legal documents, medical notes, personal user messages, proprietary source code. Mishandling any of that can expose a company to regulatory liability (GDPR, HIPAA, CCPA), reputational harm, or contractual breach with customers.

Privacy questions also directly affect what you can build. A healthcare startup cannot use an AI provider whose default terms allow training on submitted content; a bank cannot send transaction records somewhere with a 30-day retention window unless the data is properly anonymized first. Understanding exactly what each provider does — and what enterprise controls are available — determines whether a given provider is even in scope for a regulated use case.

The two separate questions you need to ask

Will my data be used to train future models? Training changes the model permanently and potentially exposes your data's patterns to other users.
How long is my data retained? Even if not used for training, a copy sitting on a server for 30 days is a different risk profile than a copy held only for milliseconds.

These two questions have independent answers. A provider can promise zero training but still keep logs for 30 days for abuse monitoring. Or it can offer a 'zero data retention' (ZDR) tier that eliminates even the temporary log. Conflating the two leads to under-protecting data or over-spending on unnecessary enterprise plans.

How it works: what actually happens to your request

When your application sends a prompt to an LLM API, the request passes through several layers on the provider's infrastructure before a response comes back. Understanding those layers explains why any retention exists at all, even when training is off.

// What happens to your API request

Your appHTTPS request + API keyAuth & rate-limit layerKey validated; request queuedSafety & abuse monitorContent screened; may log brieflyInference engineModel runs; response generatedTemporary log (retention window)Request + response stored N daysYour app receives responseLog deleted after retention window

Why retention exists at all

Providers keep logs for a narrow, specific window — typically 7 to 30 days — for three purposes: detecting abuse (spam, policy violations, jailbreak attempts), diagnosing service outages, and complying with legal preservation orders. This is the same reason every web server keeps access logs. It is not the same as training. The log is not fed into gradient updates; it is simply a timestamped record that gets deleted on schedule.

How training data is actually collected

Model training is an entirely separate pipeline from serving. Training requires huge, curated datasets, weeks of compute, and deliberate engineering decisions about what goes in. API requests are not siphoned into training pipelines automatically — doing so would create immediate legal liability and destroy enterprise trust. Providers that do use API data for training (some consumer-tier products) require explicit opt-in, and that opt-in is a distinct, auditable act, not a silent default.

Provider-by-provider breakdown

Each of the three major providers has published clear, auditable policies for API customers. The table below captures the defaults as of mid-2026. Always verify against the provider's current documentation before making compliance decisions.

Provider	Trains on API data by default?	Default retention window	Zero Data Retention option
OpenAI API	No — opt-in only	30 days (abuse monitoring)	Yes — enterprise, requires approval
Anthropic API	No — explicitly excluded	7 days (reduced Sep 2025)	Yes — enterprise ZDR agreement
Google Vertex AI	No — paid tier never trains	~55 days (safety monitoring)	Yes — eligible endpoints, contractual

OpenAI

OpenAI has not trained on API data by default since March 2023. That policy covers the API platform, ChatGPT Team, ChatGPT Enterprise, and ChatGPT for Business. If you use the Playground and voluntarily submit feedback, that feedback can be used for improvement — but only because you explicitly sent it. Standard API calls without feedback submission are not candidates for training. Abuse-monitoring logs are held for up to 30 days, then deleted. For organizations that need even that 30-day window eliminated, Zero Data Retention (ZDR) is available on eligible endpoints through an enterprise agreement with OpenAI's sales team — it is not a self-serve toggle.

Anthropic

Anthropic's commercial API terms explicitly exclude API inputs and outputs from model training. This applies to the direct API, Claude for Work (Team and Enterprise plans), and Claude accessed through Amazon Bedrock or Google Cloud's Vertex AI. In September 2025, Anthropic reduced its API log retention period from 30 days to 7 days — a meaningful reduction for customers with strict data-handling requirements. Enterprise customers with elevated needs can negotiate a ZDR agreement under which inputs and outputs are not stored beyond what is needed to screen for abuse.

Google (Vertex AI and Gemini API)

Google draws a firm line between its consumer-facing Gemini products and its paid API tiers. On the paid Gemini API and Vertex AI, Google does not use prompts or responses to improve its products. Data is retained for approximately 55 days for safety and abuse-detection purposes. The free Google AI Studio tier operates under different terms — Google does use submitted content to improve products in that context, which is why production applications should always use a paid API plan. Vertex AI supports a ZDR equivalent for eligible enterprise customers through contractual amendments to the Data Processing Addendum.

Enterprise controls and when you need them

For many teams, the default API policies — no training, short retention window — are sufficient. But regulated industries, enterprises with strict vendor-risk requirements, and apps processing highly sensitive categories of data often need additional contractual controls. Here is what is available and when each one matters.

Data Processing Addendum (DPA)

A DPA is a legal agreement that formalizes how a data processor (the AI provider) handles personal data on behalf of a controller (you). It is required by GDPR when you send EU personal data to a third-party processor. All three major providers offer DPAs. Signing one does not change the technical behavior of the API, but it creates enforceable contractual obligations and is typically required to pass a vendor-risk review.

Zero Data Retention (ZDR)

ZDR is the strongest available control: the provider does not log the request or response at all beyond what is needed to return the result. There is no 7-day or 30-day window to worry about — the data evaporates once the inference completes. ZDR is available from all three major providers but requires an enterprise agreement rather than a self-serve toggle. Not all endpoints qualify (fine-tuning, certain vision endpoints, and features that require caching state typically cannot offer ZDR).

Data residency

Some enterprises need to ensure data is processed and stored only within a specific geography (EU, US, etc.) to satisfy sovereignty laws. OpenAI expanded data residency options to business customers in 2025, and Google Cloud's Vertex AI offers regional endpoints as a standard feature. Anthropic supports EU data residency for qualifying enterprise contracts.

// Choosing the right privacy tier

Default API

No training (all 3 providers)
7-30 day retention log
No contract needed
Suitable for non-sensitive data

DPA / Enterprise Terms

Same defaults + legal obligations
GDPR / CCPA compliance
Requires agreement with provider
Required for EU personal data

Zero Data Retention

No logs after response returned
Strongest available guarantee
Enterprise agreement + approval
Required for HIPAA-level sensitivity

Practical steps before sending sensitive data

Knowing the policies is one thing; applying them to a real project is another. The following checklist covers the most common gaps teams miss when moving an LLM feature from prototype to production.

Identify the data category first. Publicly available text, internal knowledge-base articles, and medical records are entirely different risk tiers. The first needs no special handling; the third may need ZDR plus HIPAA BAA (Business Associate Agreement).
Use the paid API tier, not free tools. For any real application, only use a paid API plan. Free tiers of Google AI Studio and the free consumer apps of all three providers have weaker training-opt-out defaults.
Sign a DPA if you send EU personal data. GDPR requires a DPA with every processor that handles personal data. Most providers make this self-serve in the API dashboard or console.
Minimise what goes into the prompt. The safest data is data that never gets sent. Strip or pseudonymise personal identifiers before they reach the prompt when your use case allows it.
Check each provider's current policy directly. This article reflects mid-2026 policies. Providers update their terms; bookmark the official privacy centres (see Further Reading) and check before signing major contracts.
If ZDR is required, verify endpoint eligibility. Not every API endpoint qualifies. Confirm with your account team which specific models and features are covered before building your compliance case around ZDR.

Going deeper

Most developers never need to go further than the checklist above, but there are nuances that matter for legal teams, security architects, and anyone building in heavily regulated industries.

The difference between a processor and a controller

Under GDPR, the controller is the entity that decides why personal data is processed. The processor processes it on the controller's behalf. When you call the Anthropic API, Anthropic acts as a processor and you are the controller. This means Anthropic may only process the data as you instruct (to generate a response) and cannot repurpose it independently — including for training — without your consent. A signed DPA codifies this relationship. The same framework applies under many other privacy laws worldwide.

Inference memorisation vs. training

Even without training, frontier models can sometimes reproduce fragments of their training data verbatim — a phenomenon called memorisation. This is a property of the model's weights baked in at training time, entirely unrelated to your API calls. It is the reason you should not expect an LLM to keep a secret reliably just because you told it to in a system prompt. Sensitive data you want protected should never appear in the prompt at all; it should be retrieved at query time from a controlled data store and returned to the user without the model 'knowing' it.

Fine-tuning and the data use question

If you upload examples for fine-tuning, you are explicitly handing a dataset to the provider so they can train a derived model for you. That data is used for training — because that is the entire point. Fine-tuning data is typically held for the duration of the fine-tuned model's life and deleted when you delete the model. Read the provider's fine-tuning-specific terms separately from the general API terms; they are not the same document.

Legal preservation orders

In June 2025 OpenAI disclosed that a court order required it to retain some consumer and API content that would otherwise have been deleted on schedule. This is a standard legal-hold mechanism used across all industries. It means that even ZDR is not an absolute guarantee: a government with legal jurisdiction over the provider can compel retention. For adversarial threat models at that level, the only reliable approach is a self-hosted or air-gapped model — which is a separate topic entirely.

FAQ

Does OpenAI train on API data?

No. Since March 2023, OpenAI's policy is that API inputs and outputs are not used for training by default. You would have to explicitly opt in (for example, by submitting feedback in the Playground) for your data to be candidates for training. Abuse-monitoring logs are retained for up to 30 days, then deleted — that retention is not the same as training.

Does the Claude API use my data for training?

No. Anthropic's commercial API terms explicitly exclude API data from model training. This applies regardless of which plan you are on, and also covers access through Amazon Bedrock and Google Vertex AI. As of September 2025, API logs are held for 7 days for safety purposes and then deleted. The consumer opt-in training policy announced in August 2025 does not apply to the API.

Is it safe to send company data to an LLM API?

For non-sensitive business data, the default API policies of the major providers are generally adequate — no training, short retention window. For personally identifiable information subject to GDPR or CCPA, you need a signed DPA. For categories like health records, financial data, or attorney-client communications, you should evaluate whether Zero Data Retention is available and required. In all cases, minimising what you actually send in the prompt reduces risk.

What is zero data retention and which LLM providers offer it?

Zero data retention (ZDR) means the provider does not log your request or response beyond the milliseconds needed to compute and return the result. OpenAI, Anthropic, and Google Vertex AI all offer ZDR, but it is an enterprise-grade feature that requires a formal agreement and is not available for all endpoints. It is not a self-serve toggle you can flip in the dashboard.

Does the free tier of Gemini or Google AI Studio train on my prompts?

Yes — the free Google AI Studio tier does use submitted content to improve Google's products. Only the paid Gemini API and Vertex AI tiers come with a no-training guarantee. This is the main reason production applications should use a paid plan rather than the free web interface.

If I fine-tune a model on my data, is it kept?

Yes, deliberately — that is the point of fine-tuning. The dataset you upload is used to train a derived model version, and it is stored for the life of that model. Fine-tuning terms are separate from standard API terms. Review the provider's fine-tuning documentation and data-handling policy before uploading proprietary datasets.

// In plain English

// Why it matters

The two separate questions you need to ask

// How it works: what actually happens to your request

Why retention exists at all

How training data is actually collected

// Provider-by-provider breakdown

OpenAI

Anthropic

Google (Vertex AI and Gemini API)

// Enterprise controls and when you need them

Data Processing Addendum (DPA)

Zero Data Retention (ZDR)

Data residency

// Practical steps before sending sensitive data

// Going deeper

The difference between a processor and a controller

Inference memorisation vs. training

Fine-tuning and the data use question

Legal preservation orders

// FAQ

// Further reading

// Related