In Plain English
When a company calls a model open weights, it means they have published the trained parameter file — the billions of numbers that encode everything the model learned. You can download those numbers, load them on your own hardware, run inference, and fine-tune the model. What you are not getting is the recipe: the training data, the data-processing code, or the full training pipeline.
Think of it like a professional bakery that hands you a finished loaf of bread. You can eat it, toast it, slice it however you like, and use it as a base for your own sandwich creations. You cannot, however, walk into their kitchen and watch them bake it — and you certainly do not get the flour blend, fermentation schedule, or oven temperature they used. The bread is real. The recipe is not yours.
Open source is a much older and stricter concept that comes from software. According to the Open Source Initiative (OSI), open source software must give anyone the freedom to use it for any purpose, study how it works, modify it, and share modified versions freely. For traditional code, this meant publishing the source files under an approved license. When AI labs began releasing model weights and calling them 'open source', critics pushed back hard — because weights alone do not give you source in any meaningful sense.
Why It Matters
The open-weights vs open-source distinction is not academic hair-splitting. It has concrete consequences for builders, researchers, lawyers, and the broader AI ecosystem.
Reproducibility
If a lab releases only weights, independent researchers cannot verify the training process, reproduce the model, or check whether safety claims are accurate. True open source — with training code and data — lets anyone audit and replicate the result. This matters for both academic credibility and safety research.
Legal risk
Open-weights licenses are not standardized. Meta's Llama models, for example, come with a custom Community License that bans use by services with more than 700 million monthly active users and, in Llama 4, includes a clause restricting rights for companies based in the European Union. The OSI explicitly states that Meta's Llama 3.1 Community License fails 'freedom 0' — the freedom to use for any purpose — meaning it does not qualify as open source under any recognized definition. If your product relies on Llama, you are subject to Meta's terms, not a standard OSI-approved license.
Long-term portability
When a weights-only model is released, you depend entirely on the releasing lab for future versions, patches, and continued permission to use it. A genuinely open source model — where the community can reproduce training — is not beholden to any single organization. AllenAI's OLMo family was built precisely to demonstrate this: they released the full Dolma training corpus, training code, evaluation suite, and multiple checkpoints, so any team can reproduce or continue training.
Competitive moat
For labs, releasing weights while keeping training data and methodology private preserves a significant competitive advantage. You get the marketing benefit of 'openness' and the community goodwill that comes with it, while retaining the secret sauce that would let a competitor replicate your model quality cheaply. Critics call this openwashing — using open-sounding language while keeping the important parts closed.
How It Works: The Four Layers of AI Openness
It helps to think of AI model openness as a stack with four distinct layers. Each layer above the one below is harder to obtain and more revealing about how the model was built.
An open-weights release typically covers the bottom two layers: the weights file and enough inference code to load the model. An open source AI release, by the OSI's October 2024 definition (OSAID v1.0), must additionally provide the training code and sufficient information about the training data that a skilled practitioner could recreate a substantially equivalent system.
The OSI's four freedoms for open source AI mirror the classic software freedoms: (1) use the system for any purpose, (2) study how it works, (3) modify it, and (4) share it with or without modifications. Weights-only releases typically satisfy freedom 1 partially and freedom 2 barely — you can run the model and probe its behavior, but you cannot study why it behaves that way at the training level.
Where the license lives
For traditional open source software, the license governs the source code files. For AI models, the license governs the weights file — typically a large binary (anything from a few gigabytes for a small model to hundreds of gigabytes for a frontier one). The license determines what you can and cannot do: whether you can use the model commercially, redistribute it, fine-tune it and sell the result, or share it with a modified name.
Real-World Examples: Who Is Actually Open?
Looking at real releases makes the spectrum concrete. Models vary enormously in how open they actually are.
| Model / family | Weights | Training code | Training data | License type | OSI open source? |
|---|---|---|---|---|---|
| Meta Llama 3 / 4 | Yes | No | No | Custom Community License | No |
| Mistral Large 3 | Yes | No | No | Apache 2.0 | No (data withheld) |
| DeepSeek R1 / V3 | Yes | Partial | No | MIT | No (data withheld) |
| AllenAI OLMo 2 | Yes | Yes | Yes (Dolma) | Apache 2.0 | Closest to yes |
| EleutherAI GPT-NeoX | Yes | Yes | Yes (Pile) | Apache 2.0 | Closest to yes |
| OpenAI GPT-4o | No | No | No | Proprietary API | No |
Notice that Mistral models are often cited as a genuine bright spot in the open-weights space because Apache 2.0 is a recognized OSI-approved license with no user-count caps. However, Mistral still does not publish training data, so the models fall short of the full OSAID definition. They are among the most permissively licensed open-weights models available — but they are still open-weights, not open source.
DeepSeek caused significant buzz in early 2025 when it released R1 and subsequent models under the MIT license — one of the most permissive licenses in existence. This makes DeepSeek weights highly usable commercially, but again, the training data was never released, so full reproducibility is impossible.
AllenAI's OLMo 2 remains the closest thing the ecosystem has to a genuinely open source LLM at reasonable scale. The Dolma corpus is publicly available, the training code is on GitHub, and multiple checkpoints were published, allowing other researchers to pick up training where AllenAI left off. As AllenAI put it on release: 'a truly open LLM — training data and all.'
Openwashing: Why Labs Use the Terms Loosely
'Open source' carries enormous goodwill in the developer community. It signals transparency, trust, community involvement, and freedom from vendor lock-in. For AI labs, attaching that label to a weights release — even when it does not meet any formal definition — is a powerful marketing move. This practice has been called openwashing, by analogy with greenwashing in environmental marketing.
Mark Zuckerberg repeatedly described Meta's Llama 4 as 'open source' in his public statements and blog posts, even as the license contains geographic restrictions for EU-based companies and a hard ban on deployment by services with over 700 million users. The OSI published a pointed response stating that the Llama Community License 'is still not Open Source' — the same verdict they issued for Llama 2 and Llama 3.
This is not unique to Meta. Stability AI, Falcon, Qwen, and many other prominent model families have been marketed as open source while operating under custom licenses that restrict commercial or competitive use. The inconsistency matters because developers building products on these models may discover mid-project that the license does not allow what they intended.
A working vocabulary
- Open weights — weights (and usually inference code) published, often under a custom license with restrictions. The dominant form of AI 'openness' today.
- Open source AI (OSI definition) — weights + training code + sufficient data information, under an OSI-approved license. Rare at frontier scale.
- Open access — model available to download and use, possibly free, but with significant license restrictions. Sometimes used as a neutral middle term.
- Openwashing — marketing a restricted-license model as 'open source' to gain community trust without the transparency.
Going Deeper
For practitioners building serious products or running research pipelines on open models, the following nuances are worth understanding.
The training-data problem
Even when a lab wants to release training data, it may not be legally able to. Frontier models are trained on internet-scale corpora that include copyrighted books, news articles, code, and other material scraped with varying degrees of consent. Publishing that data verbatim would expose the lab to massive copyright liability. This is why the OSAID v1.0 includes a workaround clause: if full data distribution is not legally possible, the lab can instead release detailed metadata, data cards, processing scripts, and enough information that a third party could assemble a substantially equivalent dataset. Most labs have not done even this.
Dual licensing and tiered access
Some providers use dual licensing: a permissive license for small-scale and academic use, and a commercial license with fees or negotiated terms above a usage threshold. Mistral's Mistral-7B-Instruct-v0.3 and several other models in their lineup apply this pattern. Before building a commercial product, confirm whether your revenue or user count crosses any tier boundary in the license.
Derivative models and the 'same license' trap
Several open-weights licenses require that any fine-tuned or derived model be released under the same license, and sometimes with the same restrictions. This copyleft-style clause means that if you fine-tune Llama for a commercial product and later want to release your fine-tuned weights, you must do so under the Llama Community License — not under Apache 2.0 or MIT. Check for these clauses in sections typically labeled 'Use Restrictions' or 'Distribution of Derivative Works'.
The OSI definition is still evolving
The OSAID v1.0 was released in October 2024 and immediately drew criticism from both directions: some felt it was too strict (making genuine open source AI practically impossible at frontier scale), others felt it was too lenient (the training-data workaround lets labs off the hook too easily). Expect the definition to be revised as the ecosystem matures. Follow opensource.org/ai for current status.
Why it matters for AI safety
Open-weights releases create a tension in AI safety that does not exist with fully closed APIs. Once weights are public, they cannot be recalled — a capability flaw, a safety regression, or a misuse pathway discovered after release cannot be patched the way a cloud API can be. This has led to renewed policy debate about whether the most capable models should remain weights-only, or whether some form of staged release (trusted researchers first, then broader access) provides a better balance between openness and risk management.
FAQ
Is Llama 3 open source?
No, despite Meta's marketing. Llama 3 is released under a custom Community License that prohibits use by services with more than 700 million monthly active users and contains other restrictions the OSI explicitly says fail the open source definition. It is an open-weights model, not an open source one.
What does it mean when a model is released under MIT or Apache 2.0?
MIT and Apache 2.0 are OSI-approved licenses that allow commercial use, modification, redistribution, and sublicensing with minimal restrictions. When a model's weights are released under these licenses, you have very broad rights to use and build on them. However, the license on the weights alone does not make the model 'open source' in the full sense — training data would also need to be available.
What is the Open Source AI Definition (OSAID) and who wrote it?
The OSAID v1.0 was published by the Open Source Initiative (OSI) in October 2024 after a two-year global consultation process. It defines the four freedoms an AI system must provide — use, study, modify, share — and specifies that meeting them requires not just weights, but training code and sufficient data transparency to recreate a substantially equivalent system.
Are there any large language models that are genuinely open source?
AllenAI's OLMo 2 family is the most prominent example: weights, training code, and the full Dolma training corpus are all public under Apache 2.0. EleutherAI's GPT-NeoX and Pythia models also come close. These models are smaller than frontier models, which reflects the difficulty of releasing frontier-scale training data legally and competitively.
Can I fine-tune and sell a product built on open-weights models?
It depends entirely on the specific license. Apache 2.0 and MIT weights (like Mistral or DeepSeek) generally allow this without restrictions. Llama Community License weights require you to keep any derivative under the same license and ban use by large-scale services. Always read the license file, not the marketing copy.
Why do labs release weights but not training data?
Two main reasons: competitive advantage and legal risk. Training data is the primary moat — knowing the exact mix and curation process lets competitors replicate quality cheaply. Legal risk is also real, since frontier training sets include copyrighted material scraped from the web, and publishing them verbatim could expose the lab to copyright claims.