In plain English
The Gemini API is Google's public interface for running Gemini language models from your own code. When you chat with Gemini on Google's website, a person is typing and reading. The API is the same underlying intelligence, but a program sends the message and a program reads the answer — no browser required, no human in the loop.
A useful analogy: imagine Gemini is a very knowledgeable consultant locked inside a server room. The API is the pneumatic-tube system you use to pass notes in and receive answers back. You write your question on a slip of paper (your request payload), feed it into the tube, and a reply comes back a few seconds later. Google handles everything inside the room — the hardware, the model weights, the inference — you just deal with the tube.
The entry point for individual developers is Google AI Studio (aistudio.google.com). It is a free web interface where you prototype prompts, generate an API key, and inspect model responses — all without touching Google Cloud or setting up a billing account. Once you have a key, you send standard HTTPS requests from any language, or you use one of Google's official SDKs for Python and JavaScript/TypeScript.
Why it matters
Google's Gemini models — especially the Flash family — are among the most competitive models available in terms of price-to-capability ratio. They are also the only mainstream models with a genuinely free tier for development: real requests, real model intelligence, no credit card, with generous-enough limits to build and test a prototype end to end.
Beyond the free tier, Gemini Flash offers one of the largest context windows in the industry — meaning you can feed the model an entire codebase, a long document, or hours of transcript without hitting a truncation wall. For applications that work with long documents, multi-step conversations, or multimodal inputs (text, images, audio, video), Gemini is a strong practical choice.
What you can build with it
- Chatbots and assistants — conversational interfaces powered by Gemini instead of hand-coded responses.
- Document analysis — summarize, extract, or query long PDFs, legal contracts, or research papers that exceed most models' context windows.
- Multimodal apps — send images alongside text ("describe what's wrong with this photo") using the same API endpoint.
- Content pipelines — rewrite, translate, classify, or generate text in bulk at low cost with Flash models.
- Agentic workflows — combine Gemini with function calling to let the model invoke your tools and complete multi-step tasks.
How it works
Every Gemini API call follows the same pattern: your code sends a POST request to https://generativelanguage.googleapis.com/v1beta/models/{model}:generateContent, authenticated with your API key either in the URL (?key=...) or in the x-goog-api-key header. The request body contains a contents array — the conversation so far. Google runs the model, then returns a JSON object with the candidate reply, token counts, and a finish reason.
The contents array
Gemini uses a contents array instead of a messages array (the OpenAI convention), but the concept is identical. Each element has a role (user or model) and a parts array. Parts can be text, inline images, file references, or function results — the same object handles them all.
{
"contents": [
{
"role": "user",
"parts": [{ "text": "What is a transformer model, in two sentences?" }]
}
]
}For multi-turn conversations you resend the full history each call — Gemini, like most LLM APIs, is stateless by default. A system_instruction field at the top level of the request body sets the persona or background context (equivalent to the system role in other APIs).
Model naming convention
Google follows a versioned naming scheme: gemini-{generation}.{minor}-{variant}. Flash models are fast and cheap; Pro models are larger and more capable. You pass the model name as a path segment in the URL or as the model parameter in the SDK. Always check the official models reference for the current list, because Google deprecates older versions and releases new ones regularly.
Your first call, step by step
Four steps from a blank slate to a working Gemini integration.
- Get an API key. Go to aistudio.google.com, sign in with a Google account, click Get API key in the left sidebar, then Create API key. Google creates a new Cloud project and generates the key in one step. It starts with
AIzafollowed by a long alphanumeric string. - Store it safely. Never put the key directly in source code or commit it to a repository. Set it as an environment variable:
export GEMINI_API_KEY=AIza...on macOS/Linux. The official SDK readsGEMINI_API_KEYautomatically. - Install the SDK. Google publishes
google-genaifor Python and@google/genaifor JavaScript/TypeScript. - Send the request and print the reply.
pip install -U google-genai
export GEMINI_API_KEY=AIza... # replace with your real keyfrom google import genai
# Reads GEMINI_API_KEY from the environment automatically.
client = genai.Client()
response = client.models.generate_content(
model="gemini-2.5-flash",
contents="Explain what an API is in three simple sentences.",
)
# The reply text is here.
print(response.text)
print("Input tokens:", response.usage_metadata.prompt_token_count)
print("Output tokens:", response.usage_metadata.candidates_token_count)That's a complete Gemini integration. Run it and a short answer appears alongside token counts. Every more advanced feature — streaming, multi-turn chat, function calling — is a variation on this same structure.
JavaScript / TypeScript
import { GoogleGenAI } from "@google/genai";
const ai = new GoogleGenAI({ apiKey: process.env.GEMINI_API_KEY });
const response = await ai.models.generateContent({
model: "gemini-2.5-flash",
contents: "What is a context window?",
});
console.log(response.text);Raw HTTP with curl
You can also call the API directly without any SDK — useful for quick tests or in environments where installing a package is inconvenient.
curl "https://generativelanguage.googleapis.com/v1beta/models/gemini-2.5-flash:generateContent" \
-H "x-goog-api-key: $GEMINI_API_KEY" \
-H "Content-Type: application/json" \
-X POST \
-d '{"contents":[{"parts":[{"text":"Explain how AI works in two sentences."}]}]}'Free tier, rate limits, and pricing
The Gemini API free tier requires no credit card and no billing account. It covers Flash and Flash-Lite models. Pro models moved behind paid billing in early 2026. The free tier is well suited for development, prototyping, and low-volume applications. When you need higher throughput or Pro model access, you enable Google Cloud billing and pay per token.
| Model tier | Free RPM | Free RPD | Free TPM | Paid input (per 1M tokens) | Paid output (per 1M tokens) |
|---|---|---|---|---|---|
| Flash-Lite | 15 | 1,000 | 250,000 | $0.10 | $0.40 |
| Flash | 10 | 250 | 250,000 | $0.30 | $2.50 |
| Pro | 5 | 100 | 250,000 | $4.00 | $18.00 |
How token billing works
You are billed separately for input tokens (everything you send — prompt, system instruction, conversation history, and any file bytes) and output tokens (what the model generates). Output costs roughly 4–8x more than input depending on the model tier. In a long multi-turn conversation, input tokens grow with every call because you resend the full history — watch response.usage_metadata.prompt_token_count as conversations grow to avoid surprise costs.
Context caching
Gemini supports context caching: you upload a large, static chunk of content once (a long document, a system prompt, a code file), and the model caches it server-side. Subsequent calls that reference the cache pay a significantly lower rate for those cached tokens. If your application repeatedly uses the same large document across many requests, context caching can cut costs substantially.
Long context and multimodal inputs
One of Gemini's defining strengths is its context window — the total number of tokens the model can process in a single request (input plus output combined). As of mid-2026, Flash models support up to 1 million tokens and Pro models up to 2 million tokens. For reference, a million tokens is roughly 750,000 English words — enough to hold several full-length novels, or an entire medium-sized codebase.
This changes what is practical. Instead of chunking a 300-page contract and summarizing each chunk separately, you can send the entire document in one call and ask a precise question. Instead of summarizing a codebase incrementally, you can provide all the relevant files at once. Long-context access is available on the free tier, though very large requests will consume your daily quota quickly.
Multimodal requests
The same generateContent endpoint accepts images, audio clips, and video alongside text — all mixed together in the parts array of a single request. You can pass images as Base64-encoded strings inline, or as URLs pointing to files already uploaded to the Gemini File API. For example, sending a screenshot and asking "what's wrong with this UI?" uses exactly the same code structure as a plain text request, just with an extra part containing the image data.
import base64
from google import genai
from google.genai import types
client = genai.Client()
with open("screenshot.png", "rb") as f:
image_bytes = f.read()
response = client.models.generate_content(
model="gemini-2.5-flash",
contents=[
types.Content(
role="user",
parts=[
types.Part(text="What is shown in this image and what could be improved?"),
types.Part(
inline_data=types.Blob(
mime_type="image/png",
data=base64.b64encode(image_bytes).decode(),
)
),
],
)
],
)
print(response.text)Going deeper
Once the basics work, several additional capabilities unlock more powerful applications — all as parameters on the same generateContent call you already know.
Streaming responses
By default you wait for the entire reply, then receive it at once. For long answers this means staring at a blank screen for seconds. Streaming delivers tokens as they're generated — the typewriter effect you see in chat interfaces. Use client.models.generate_content_stream() in the Python SDK. Each chunk yields a partial candidate with accumulated text.
from google import genai
client = genai.Client()
for chunk in client.models.generate_content_stream(
model="gemini-2.5-flash",
contents="Write a short story about a robot learning to cook.",
):
print(chunk.text, end="", flush=True)Function calling (tool use)
Gemini can't browse the web or query a database on its own. Function calling lets you describe tools the model may invoke. When Gemini decides a tool is needed, it returns a structured functionCall part instead of prose; your code runs the function and sends the result back in the next turn. This is the foundation of agentic apps where the model orchestrates real actions.
Gemini API vs. Vertex AI — when to switch
The Gemini Developer API (via AI Studio) is the right choice for individual developers, students, and startups building prototypes or early-stage products. The free tier, fast key generation, and minimal setup make it frictionless. Vertex AI is Google's enterprise platform: it runs the same Gemini models but adds SLAs, data residency controls, VPC networking, IAM, audit logging, and integration with the broader Google Cloud ecosystem. The recommended path is to start with the Developer API and migrate to Vertex AI when you need enterprise compliance or scale — the unified google-genai SDK supports both backends by changing one environment variable.
- Free tier, no credit card required
- Key generated in seconds
- No Cloud project setup needed
- Best for prototypes and small projects
- Community-level support
- No free tier — enterprise billing
- Full Cloud project + IAM setup
- SLA, data residency, VPC support
- Best for production at scale
- Enterprise support + compliance
Error handling and common HTTP errors
Real apps encounter failures. A 429 means you've exceeded your rate limit (requests per minute or tokens per minute). A 400 usually means a malformed request — check the model name and content structure. A 403 means your API key is invalid, expired, or blocked. The Python SDK raises typed exceptions you can catch:
from google import genai
from google.api_core import exceptions as api_exceptions
client = genai.Client()
try:
response = client.models.generate_content(
model="gemini-2.5-flash",
contents="Hello!",
)
print(response.text)
except api_exceptions.ResourceExhausted:
print("Rate limit hit — slow down or upgrade tier.")
except api_exceptions.Unauthenticated:
print("API key invalid or expired — check GEMINI_API_KEY.")
except api_exceptions.InvalidArgument as e:
print(f"Bad request: {e}")| HTTP code | Means | What to do |
|---|---|---|
| 400 | Bad request — invalid model or malformed contents | Check model string and contents array structure |
| 403 | API key invalid, expired, or key type rejected | Regenerate your key in AI Studio; ensure it is an auth key |
| 429 | Rate limit exceeded (RPM or TPM) | Add backoff/retry logic; consider upgrading to paid tier |
| 500/503 | Google server error | Retry with exponential backoff; transient issue |
FAQ
What is the Gemini API in simple terms?
It is a way for your code to send text (and optionally images, audio, or video) to Google's Gemini language models and receive an AI-generated reply. You make an HTTPS request to Google's servers with your question, a model name, and a secret API key. Google runs the model and returns the answer as JSON. It is how you embed Gemini inside an app, script, or automated workflow instead of using the Google Gemini website.
How do I get a Gemini API key for free?
Go to aistudio.google.com and sign in with a Google account. Click Get API key in the left sidebar, then Create API key. Google generates a key instantly — no credit card and no Google Cloud project required. Store it in an environment variable (GEMINI_API_KEY) and never paste it into source code.
What models are available on the Gemini API free tier?
As of mid-2026, the free tier covers Flash and Flash-Lite models only. Pro models require a paid billing account. The free tier has real rate limits (requests per minute and per day) that are generous enough for development and light production use, but you will need to enable billing to scale up or access Pro-tier capability.
How large is the Gemini context window?
Flash models support up to 1 million tokens of context and Pro models support up to 2 million tokens. That is enough to hold an entire large codebase, a book-length document, or hours of transcript in a single call. Both paid and free-tier requests can use the full context window, though very large requests consume your daily quota faster.
What is the difference between the Gemini API and Vertex AI?
Both run Gemini models, but they target different audiences. The Gemini Developer API (accessed through Google AI Studio) is for individual developers and startups — free tier, instant key setup, no Cloud infrastructure. Vertex AI is Google's enterprise platform and adds SLAs, data residency, VPC networking, IAM, and audit logging. Start with the Developer API and migrate to Vertex AI only when you need enterprise compliance or Google Cloud integration. The google-genai SDK supports both with a one-line config change.
Does the Gemini API support images and multimodal inputs?
Yes. The same generateContent endpoint accepts text, images, audio, and video mixed together in the parts array of a single request. You can pass images inline as Base64-encoded bytes or as URIs pointing to files uploaded via the Gemini File API. Multimodal support is available on Flash models and is included in the free tier.