In plain English
The Hugging Face Hub is a public library with over one million model checkpoints. Each model is a repository: a folder of weight files, a config, and a card explaining what the model does. Downloading a model means pulling those files to your machine and then asking a Python library to load them into memory.

A good analogy is a music streaming app. Before streaming existed, you downloaded an .mp3 to your hard drive and played it with a media player. The Hub is the music store, the weight files are the .mp3 tracks, and the Transformers library is the media player that knows how to decode them. Once you have the files locally, the model works offline — no API key, no internet needed.
There are two main paths to running a model locally. The first — and most beginner-friendly — is the Transformers pipeline: install the library, pick a model ID from the Hub, and call pipeline(). The library downloads the weights automatically, caches them on disk, and gives you a Python function you can call with plain text. The second path is downloading a GGUF file and running it with llama.cpp or a wrapper like Ollama — this is more efficient on CPU and is covered toward the end of this article.
Why it matters for builders
Running a model locally instead of calling a cloud API has three concrete advantages for a developer.
- Privacy — your prompts and outputs never leave your machine. That matters for code completions, customer data, or anything under an NDA.
- Cost — once the model is downloaded, inference is free. No per-token billing, no rate limits, no surprise invoices.
- Control — you choose the exact model version, can swap it out, fine-tune it, and quantize it without waiting on a vendor's API roadmap.
The Hugging Face ecosystem also lowers the barrier dramatically compared to a few years ago. Before the Transformers library existed, running a research model meant reading the original paper's code, tracking down incompatible dependencies, and hacking together your own inference loop. Now a two-line pipeline() call handles tokenization, batching, device placement, and decoding for you.
How the download and loading process works
When you call from_pretrained('org/model-name') or pipeline('text-generation', model='org/model-name'), the Transformers library follows a chain of steps to get the model running.
The cache lives at ~/.cache/huggingface/hub by default (on Windows: C:\Users\<you>\.cache\huggingface\hub). Files are stored by content hash, so the same file is never downloaded twice even if two models share a tokenizer. You can redirect the cache by setting the HF_HOME environment variable before running your script.
The weight files are usually in safetensors format (.safetensors), a safer and faster alternative to the older pickle-based .bin format. For very large models the weights are split across multiple shards (e.g. model-00001-of-00004.safetensors). The library handles reassembling shards transparently.
A model's repo ID is always <owner>/<model-name>, for example HuggingFaceTB/SmolLM2-1.7B-Instruct or microsoft/Phi-3-mini-4k-instruct. You find this ID in the URL of the Hub page and pass it verbatim to any Transformers call.
Step-by-step: run a model with Transformers
1. Install the library
Transformers requires Python 3.10 or later. Install it with PyTorch bundled in one command:
pip install "transformers[torch]"If you prefer uv (faster resolver): uv pip install "transformers[torch]". On a machine without a GPU the CPU-only PyTorch build is pulled automatically; you do not need any special flag.
2. Pick a beginner-friendly model
For your first run, choose a model small enough to load on a laptop. Two solid options:
| Model ID | Parameters | RAM needed | Good for |
|---|---|---|---|
| HuggingFaceTB/SmolLM2-1.7B-Instruct | 1.7 B | ~4 GB | Fast CPU test, chat |
| microsoft/Phi-3-mini-4k-instruct | 3.8 B | ~8 GB | Stronger reasoning on CPU/GPU |
3. Run with the pipeline API
The pipeline function is the quickest entry point. It bundles the model, tokenizer, and a sensible inference loop into one callable object.
from transformers import pipeline
# First call downloads ~3-8 GB to ~/.cache/huggingface/hub
generator = pipeline(
"text-generation",
model="HuggingFaceTB/SmolLM2-1.7B-Instruct",
)
result = generator(
"Explain what a neural network is in one sentence:",
max_new_tokens=80,
do_sample=True,
temperature=0.7,
)
print(result[0]["generated_text"])On the first run you will see a progress bar as each file is downloaded. Subsequent runs skip the download and load from the local cache immediately.
4. Use the lower-level API for more control
When you need to inspect logits, apply custom generation constraints, or batch multiple prompts, load the model and tokenizer separately with AutoModelForCausalLM and AutoTokenizer:
from transformers import AutoTokenizer, AutoModelForCausalLM
import torch
model_id = "HuggingFaceTB/SmolLM2-1.7B-Instruct"
tokenizer = AutoTokenizer.from_pretrained(model_id)
model = AutoModelForCausalLM.from_pretrained(
model_id,
torch_dtype=torch.float16, # halves memory on GPU; use float32 on CPU
device_map="auto", # places layers on GPU if available, else CPU
)
prompt = "What is the difference between RAM and storage?"
inputs = tokenizer(prompt, return_tensors="pt").to(model.device)
with torch.no_grad():
output_ids = model.generate(**inputs, max_new_tokens=120)
print(tokenizer.decode(output_ids[0], skip_special_tokens=True))Alternative: download with the CLI or run a GGUF file
The Transformers path is convenient but downloads the full-precision weights, which can be 4-15 GB for a 3-7 B model. If you are on a CPU-only machine or simply want smaller files, the GGUF path is the better choice: GGUF files are quantized (compressed) copies of the same model that fit in much less RAM and run faster on CPU.
Download any file with huggingface-cli
The huggingface-cli tool (also callable as hf) comes bundled with the huggingface_hub package (which is installed automatically with Transformers). You can use it to download individual files or whole repos:
# Download an entire model repo into the default cache
huggingface-cli download HuggingFaceTB/SmolLM2-1.7B-Instruct
# Download a specific GGUF file to a local folder
huggingface-cli download microsoft/Phi-3-mini-4k-instruct-gguf \
Phi-3-mini-4k-instruct-q4.gguf \
--local-dir ~/models/The --local-dir flag saves files to a plain folder of your choice instead of the hashed cache. That makes them easier to find and pass to other tools.
Run a GGUF file directly with llama.cpp
Once you have a .gguf file you can run inference with llama-cli, part of the llama.cpp project. The -hf flag can even download from the Hub for you in one step:
# Install llama.cpp (macOS/Linux via Homebrew)
brew install llama.cpp
# Download AND run directly from the Hub
llama-cli -hf microsoft/Phi-3-mini-4k-instruct-gguf \
--prompt "Explain transformers in two sentences" \
-n 120Or, if you already downloaded the file with huggingface-cli:
llama-cli -m ~/models/Phi-3-mini-4k-instruct-q4.gguf \
--prompt "Explain transformers in two sentences" \
-n 120Transformers vs GGUF: when to pick each
| Factor | Transformers pipeline | GGUF + llama.cpp |
|---|---|---|
| Ease of setup | pip install, two lines of Python | Separate install (brew or build) |
| Memory usage | Full precision: 2x-4x larger | Quantized: much smaller (Q4 ~4 GB for 7B) |
| CPU speed | Slower (PyTorch not optimized for CPU) | Faster (optimized C++ kernels) |
| GPU support | CUDA, MPS (Apple), ROCm | CUDA, Metal, ROCm |
| Python integration | Native — easy to embed in scripts | Needs subprocess or llama-cpp-python binding |
| Model variety | All Hub model types (vision, audio, etc.) | Text-generation models only |
Common pitfalls and how to avoid them
- Out-of-memory crash — if Python exits with a SIGKILL or
RuntimeError: CUDA out of memory, the model is too large for your hardware. Try a smaller model, addload_in_4bit=True(requires thebitsandbyteslibrary), or switch to the GGUF path. - Slow first load — Transformers caches files after the first download but still deserializes weights from disk on every run. Expect 15-60 seconds to load a 3 B model even from cache. This is normal.
- Wrong model class — calling
AutoModelForCausalLMon an encoder-only model (like BERT) will raise an error. Always check the model card on the Hub to confirm the architecture and intended task. trust_remote_code=Truerisk — some models require this flag, which executes arbitrary Python code from the repo. Only set it for models from publishers you trust; never set it blindly on an unknown model.- Cache filling your disk — large models accumulate quickly. Run
huggingface-cli delete-cache(orhf cache delete) to interactively choose which cached repos to remove.
Going deeper
Once the basic flow works, several directions open up.
Quantization for lower RAM
The Transformers library integrates with bitsandbytes for 4-bit and 8-bit quantization. Passing load_in_4bit=True to from_pretrained() roughly quarters the memory footprint, making a 7 B model fit in ~6 GB of VRAM. The BitsAndBytesConfig class lets you tune quantization type (NF4 vs FP4) and whether to use double quantization for extra savings.
from transformers import AutoModelForCausalLM, BitsAndBytesConfig
import torch
quant_config = BitsAndBytesConfig(
load_in_4bit=True,
bnb_4bit_compute_dtype=torch.float16,
)
model = AutoModelForCausalLM.from_pretrained(
"microsoft/Phi-3-mini-4k-instruct",
quantization_config=quant_config,
device_map="auto",
)Chat templates and instruction formatting
Instruct-tuned models (any model with -Instruct, -it, or -chat in its name) expect prompts wrapped in a special chat template — a structured format that tells the model which text is a system instruction, which is the user's message, and where the assistant's reply begins. Passing a raw string produces garbled output. Use the tokenizer's apply_chat_template method instead:
messages = [
{"role": "system", "content": "You are a helpful assistant."},
{"role": "user", "content": "What is a tokenizer?"},
]
prompt = tokenizer.apply_chat_template(
messages,
tokenize=False,
add_generation_prompt=True,
)
inputs = tokenizer(prompt, return_tensors="pt").to(model.device)Exploring the Hub programmatically
The huggingface_hub Python library lets you search and inspect models from code. list_models(filter='text-generation', sort='downloads', limit=10) returns the ten most-downloaded text generation models. You can filter by pipeline tag, library, language, and licence. This is useful when building tools that need to present model choices to a user.
Offline and air-gapped environments
Set the environment variable HF_HUB_OFFLINE=1 (or TRANSFORMERS_OFFLINE=1) before running your script. Transformers will then load only from the local cache and raise an error instead of attempting a network request. This is essential for production deployments where outbound HTTP is blocked. Pre-download all required model repos during your CI/CD build step and ship the cache directory alongside your application.
FAQ
How much disk space do I need to download a model?
It depends on the model. A 1.7 B parameter model like SmolLM2 takes roughly 3-4 GB in full precision (float16). A 7 B model takes 13-14 GB. GGUF Q4 quantized versions are about half those sizes. Check the file listing on the Hub model card before downloading.
Do I need a Hugging Face account to download models?
Most models are publicly available without authentication. A small number of models — including some Meta Llama variants — require you to accept a licence agreement on the Hub and then set a HF_TOKEN environment variable. You will see a clear error message if a token is required.
Why is inference so slow on my laptop?
CPU inference in PyTorch is significantly slower than GPU inference. For a 3 B model on a modern laptop CPU, expect 5-20 tokens per second, compared to 30-100+ tokens per second on a mid-range GPU. Switching to a GGUF file with llama.cpp typically doubles CPU speed. Choosing a smaller model (1-2 B) also helps a lot.
What is the difference between a base model and an instruct model?
A base model is trained to predict the next token in raw text — it will continue your prompt as if it were a document. An instruct model is fine-tuned further to follow instructions and hold a conversation. For practical use (Q&A, summarization, coding help), always pick the -Instruct or -chat variant of a model.
How do I update a cached model to a newer version?
By default, Transformers checks the Hub for updates each time you load a model while online. If the repo has been updated, the new files are downloaded. You can force a re-download by passing force_download=True to from_pretrained(). To pin to a specific commit, pass the full commit hash as the revision argument.
Can I run the model completely offline after the first download?
Yes. After the first successful download the weights are in ~/.cache/huggingface/hub. Set HF_HUB_OFFLINE=1 in your environment and Transformers will never attempt a network request. This is the recommended approach for production or air-gapped deployments.