AI/TLDR

How to Download and Run Your First Hugging Face Model

You will download a real model from the Hub and run your first generation in Python by the end of this article.

BEGINNER10 MIN READUPDATED 2026-06-12

In plain English

The Hugging Face Hub is a public library with over one million model checkpoints. Each model is a repository: a folder of weight files, a config, and a card explaining what the model does. Downloading a model means pulling those files to your machine and then asking a Python library to load them into memory.

Surangama Basu with Sukanta Pal - Editing Session - Wikilearnopedia -…
Surangama Basu with Sukanta Pal - Editing Session - Wikilearnopedia -… — Biswarup Ganguly

A good analogy is a music streaming app. Before streaming existed, you downloaded an .mp3 to your hard drive and played it with a media player. The Hub is the music store, the weight files are the .mp3 tracks, and the Transformers library is the media player that knows how to decode them. Once you have the files locally, the model works offline — no API key, no internet needed.

There are two main paths to running a model locally. The first — and most beginner-friendly — is the Transformers pipeline: install the library, pick a model ID from the Hub, and call pipeline(). The library downloads the weights automatically, caches them on disk, and gives you a Python function you can call with plain text. The second path is downloading a GGUF file and running it with llama.cpp or a wrapper like Ollama — this is more efficient on CPU and is covered toward the end of this article.

Why it matters for builders

Running a model locally instead of calling a cloud API has three concrete advantages for a developer.

  • Privacy — your prompts and outputs never leave your machine. That matters for code completions, customer data, or anything under an NDA.
  • Cost — once the model is downloaded, inference is free. No per-token billing, no rate limits, no surprise invoices.
  • Control — you choose the exact model version, can swap it out, fine-tune it, and quantize it without waiting on a vendor's API roadmap.

The Hugging Face ecosystem also lowers the barrier dramatically compared to a few years ago. Before the Transformers library existed, running a research model meant reading the original paper's code, tracking down incompatible dependencies, and hacking together your own inference loop. Now a two-line pipeline() call handles tokenization, batching, device placement, and decoding for you.

How the download and loading process works

When you call from_pretrained('org/model-name') or pipeline('text-generation', model='org/model-name'), the Transformers library follows a chain of steps to get the model running.

The cache lives at ~/.cache/huggingface/hub by default (on Windows: C:\Users\<you>\.cache\huggingface\hub). Files are stored by content hash, so the same file is never downloaded twice even if two models share a tokenizer. You can redirect the cache by setting the HF_HOME environment variable before running your script.

The weight files are usually in safetensors format (.safetensors), a safer and faster alternative to the older pickle-based .bin format. For very large models the weights are split across multiple shards (e.g. model-00001-of-00004.safetensors). The library handles reassembling shards transparently.

A model's repo ID is always <owner>/<model-name>, for example HuggingFaceTB/SmolLM2-1.7B-Instruct or microsoft/Phi-3-mini-4k-instruct. You find this ID in the URL of the Hub page and pass it verbatim to any Transformers call.

Step-by-step: run a model with Transformers

1. Install the library

Transformers requires Python 3.10 or later. Install it with PyTorch bundled in one command:

bashbash
pip install "transformers[torch]"

If you prefer uv (faster resolver): uv pip install "transformers[torch]". On a machine without a GPU the CPU-only PyTorch build is pulled automatically; you do not need any special flag.

2. Pick a beginner-friendly model

For your first run, choose a model small enough to load on a laptop. Two solid options:

Model IDParametersRAM neededGood for
HuggingFaceTB/SmolLM2-1.7B-Instruct1.7 B~4 GBFast CPU test, chat
microsoft/Phi-3-mini-4k-instruct3.8 B~8 GBStronger reasoning on CPU/GPU

3. Run with the pipeline API

The pipeline function is the quickest entry point. It bundles the model, tokenizer, and a sensible inference loop into one callable object.

pythonpython
from transformers import pipeline

# First call downloads ~3-8 GB to ~/.cache/huggingface/hub
generator = pipeline(
    "text-generation",
    model="HuggingFaceTB/SmolLM2-1.7B-Instruct",
)

result = generator(
    "Explain what a neural network is in one sentence:",
    max_new_tokens=80,
    do_sample=True,
    temperature=0.7,
)

print(result[0]["generated_text"])

On the first run you will see a progress bar as each file is downloaded. Subsequent runs skip the download and load from the local cache immediately.

4. Use the lower-level API for more control

When you need to inspect logits, apply custom generation constraints, or batch multiple prompts, load the model and tokenizer separately with AutoModelForCausalLM and AutoTokenizer:

pythonpython
from transformers import AutoTokenizer, AutoModelForCausalLM
import torch

model_id = "HuggingFaceTB/SmolLM2-1.7B-Instruct"

tokenizer = AutoTokenizer.from_pretrained(model_id)
model = AutoModelForCausalLM.from_pretrained(
    model_id,
    torch_dtype=torch.float16,   # halves memory on GPU; use float32 on CPU
    device_map="auto",           # places layers on GPU if available, else CPU
)

prompt = "What is the difference between RAM and storage?"
inputs = tokenizer(prompt, return_tensors="pt").to(model.device)

with torch.no_grad():
    output_ids = model.generate(**inputs, max_new_tokens=120)

print(tokenizer.decode(output_ids[0], skip_special_tokens=True))

Alternative: download with the CLI or run a GGUF file

The Transformers path is convenient but downloads the full-precision weights, which can be 4-15 GB for a 3-7 B model. If you are on a CPU-only machine or simply want smaller files, the GGUF path is the better choice: GGUF files are quantized (compressed) copies of the same model that fit in much less RAM and run faster on CPU.

Download any file with huggingface-cli

The huggingface-cli tool (also callable as hf) comes bundled with the huggingface_hub package (which is installed automatically with Transformers). You can use it to download individual files or whole repos:

bashbash
# Download an entire model repo into the default cache
huggingface-cli download HuggingFaceTB/SmolLM2-1.7B-Instruct

# Download a specific GGUF file to a local folder
huggingface-cli download microsoft/Phi-3-mini-4k-instruct-gguf \
    Phi-3-mini-4k-instruct-q4.gguf \
    --local-dir ~/models/

The --local-dir flag saves files to a plain folder of your choice instead of the hashed cache. That makes them easier to find and pass to other tools.

Run a GGUF file directly with llama.cpp

Once you have a .gguf file you can run inference with llama-cli, part of the llama.cpp project. The -hf flag can even download from the Hub for you in one step:

bashbash
# Install llama.cpp (macOS/Linux via Homebrew)
brew install llama.cpp

# Download AND run directly from the Hub
llama-cli -hf microsoft/Phi-3-mini-4k-instruct-gguf \
    --prompt "Explain transformers in two sentences" \
    -n 120

Or, if you already downloaded the file with huggingface-cli:

bashbash
llama-cli -m ~/models/Phi-3-mini-4k-instruct-q4.gguf \
    --prompt "Explain transformers in two sentences" \
    -n 120

Transformers vs GGUF: when to pick each

FactorTransformers pipelineGGUF + llama.cpp
Ease of setuppip install, two lines of PythonSeparate install (brew or build)
Memory usageFull precision: 2x-4x largerQuantized: much smaller (Q4 ~4 GB for 7B)
CPU speedSlower (PyTorch not optimized for CPU)Faster (optimized C++ kernels)
GPU supportCUDA, MPS (Apple), ROCmCUDA, Metal, ROCm
Python integrationNative — easy to embed in scriptsNeeds subprocess or llama-cpp-python binding
Model varietyAll Hub model types (vision, audio, etc.)Text-generation models only

Common pitfalls and how to avoid them

  • Out-of-memory crash — if Python exits with a SIGKILL or RuntimeError: CUDA out of memory, the model is too large for your hardware. Try a smaller model, add load_in_4bit=True (requires the bitsandbytes library), or switch to the GGUF path.
  • Slow first load — Transformers caches files after the first download but still deserializes weights from disk on every run. Expect 15-60 seconds to load a 3 B model even from cache. This is normal.
  • Wrong model class — calling AutoModelForCausalLM on an encoder-only model (like BERT) will raise an error. Always check the model card on the Hub to confirm the architecture and intended task.
  • trust_remote_code=True risk — some models require this flag, which executes arbitrary Python code from the repo. Only set it for models from publishers you trust; never set it blindly on an unknown model.
  • Cache filling your disk — large models accumulate quickly. Run huggingface-cli delete-cache (or hf cache delete) to interactively choose which cached repos to remove.

Going deeper

Once the basic flow works, several directions open up.

Quantization for lower RAM

The Transformers library integrates with bitsandbytes for 4-bit and 8-bit quantization. Passing load_in_4bit=True to from_pretrained() roughly quarters the memory footprint, making a 7 B model fit in ~6 GB of VRAM. The BitsAndBytesConfig class lets you tune quantization type (NF4 vs FP4) and whether to use double quantization for extra savings.

pythonpython
from transformers import AutoModelForCausalLM, BitsAndBytesConfig
import torch

quant_config = BitsAndBytesConfig(
    load_in_4bit=True,
    bnb_4bit_compute_dtype=torch.float16,
)

model = AutoModelForCausalLM.from_pretrained(
    "microsoft/Phi-3-mini-4k-instruct",
    quantization_config=quant_config,
    device_map="auto",
)

Chat templates and instruction formatting

Instruct-tuned models (any model with -Instruct, -it, or -chat in its name) expect prompts wrapped in a special chat template — a structured format that tells the model which text is a system instruction, which is the user's message, and where the assistant's reply begins. Passing a raw string produces garbled output. Use the tokenizer's apply_chat_template method instead:

pythonpython
messages = [
    {"role": "system", "content": "You are a helpful assistant."},
    {"role": "user", "content": "What is a tokenizer?"},
]

prompt = tokenizer.apply_chat_template(
    messages,
    tokenize=False,
    add_generation_prompt=True,
)
inputs = tokenizer(prompt, return_tensors="pt").to(model.device)

Exploring the Hub programmatically

The huggingface_hub Python library lets you search and inspect models from code. list_models(filter='text-generation', sort='downloads', limit=10) returns the ten most-downloaded text generation models. You can filter by pipeline tag, library, language, and licence. This is useful when building tools that need to present model choices to a user.

Offline and air-gapped environments

Set the environment variable HF_HUB_OFFLINE=1 (or TRANSFORMERS_OFFLINE=1) before running your script. Transformers will then load only from the local cache and raise an error instead of attempting a network request. This is essential for production deployments where outbound HTTP is blocked. Pre-download all required model repos during your CI/CD build step and ship the cache directory alongside your application.

FAQ

How much disk space do I need to download a model?

It depends on the model. A 1.7 B parameter model like SmolLM2 takes roughly 3-4 GB in full precision (float16). A 7 B model takes 13-14 GB. GGUF Q4 quantized versions are about half those sizes. Check the file listing on the Hub model card before downloading.

Do I need a Hugging Face account to download models?

Most models are publicly available without authentication. A small number of models — including some Meta Llama variants — require you to accept a licence agreement on the Hub and then set a HF_TOKEN environment variable. You will see a clear error message if a token is required.

Why is inference so slow on my laptop?

CPU inference in PyTorch is significantly slower than GPU inference. For a 3 B model on a modern laptop CPU, expect 5-20 tokens per second, compared to 30-100+ tokens per second on a mid-range GPU. Switching to a GGUF file with llama.cpp typically doubles CPU speed. Choosing a smaller model (1-2 B) also helps a lot.

What is the difference between a base model and an instruct model?

A base model is trained to predict the next token in raw text — it will continue your prompt as if it were a document. An instruct model is fine-tuned further to follow instructions and hold a conversation. For practical use (Q&A, summarization, coding help), always pick the -Instruct or -chat variant of a model.

How do I update a cached model to a newer version?

By default, Transformers checks the Hub for updates each time you load a model while online. If the repo has been updated, the new files are downloaded. You can force a re-download by passing force_download=True to from_pretrained(). To pin to a specific commit, pass the full commit hash as the revision argument.

Can I run the model completely offline after the first download?

Yes. After the first successful download the weights are in ~/.cache/huggingface/hub. Set HF_HUB_OFFLINE=1 in your environment and Transformers will never attempt a network request. This is the recommended approach for production or air-gapped deployments.

Further reading