Overview
Gemini 1.5 Flash-8B was Google's smallest and lowest-cost model in the Gemini 1.5 family. Announced in preview during 2024 and made generally available on October 3, 2024 (as gemini-1.5-flash-8b-001), it was a transformer-decoder model derived from Gemini 1.5 Flash — distilled down to roughly 8 billion parameters while keeping the same multimodal capabilities and a context window exceeding 1 million tokens. Google described it as having the lowest cost per intelligence of any Gemini model at the time, and made it the seed of what later became the Gemini Flash-Lite tier.
Despite its tiny size, Gemini 1.5 Flash-8B accepted text, images, audio, and video as input and could read up to about 1,048,576 tokens of context (returning up to 8,192 output tokens). Google's own technical report showed it reaching roughly 80–90% of full Gemini 1.5 Flash quality on many benchmarks, making it well suited to high-volume, latency-sensitive jobs: chat, transcription, long-context translation, large-scale data labeling, and high-throughput agent serving. It launched at half the price of 1.5 Flash with double the rate limits and was available free through Google AI Studio and the Gemini API.
Gemini 1.5 Flash-8B is no longer a current model. Google deprecated the entire Gemini 1.5 line and shut down gemini-1.5-flash-8b on September 24, 2025, after which API calls return errors. Google steered users toward the cheaper, faster successors in the Flash and Flash-Lite lines (Gemini 2.0 Flash / Flash-Lite and later 2.5 Flash-Lite). This page documents it as a historical model.
| Released | 2024-10-03 |
|---|---|
| License | Proprietary |
| Weights | API only |
| Parameters | ~8B (single-digit billions; exact count undisclosed) |
| Context | 1M |
| Max output | 8K |
| Architecture | Transformer decoder model derived from Gemini 1.5 Flash, inheriting the same core architecture, optimizations, and data-mixture refinements at a smaller (~8B parameter) scale, with native multimodal support and a 1M+ token context window. Google did not publish the full parameter count. |
| Knowledge cutoff | August 2024 |
| Modalities | Text, Vision, Audio, Video |
| Status | Discontinued (shut down September 24, 2025) |
Benchmarks
- MMLU (5-shot)68.1%
- GPQA (0-shot)30.8%
- MATH (4-shot, Minerva prompt)35.9%
- BigBench-Hard (3-shot)69.5%
- Natural2Code (0-shot)67.6%
- MGSM (8-shot)70.5%
- MMMU (val, 4-shot)50.3%
- DocVQA (0-shot)73.6%
- TextVQA (0-shot)66.7%
Scores on a 0–100 scale (25-point gridlines); higher is better. Each benchmark links to its published source.
Pricing
| Input | $0.0375 / 1M tokens (prompts < 128K) per 1M tokens |
|---|---|
| Cached input | $0.01 / 1M tokens (prompts < 128K) per 1M tokens |
| Output | $0.15 / 1M tokens (prompts < 128K) per 1M tokens |
Launch pricing from October 2024; prices doubled for prompts longer than 128K tokens. A free tier was available via Google AI Studio and the Gemini API. The model was shut down on September 24, 2025 and is no longer billable.
Strengths
- Extremely low cost: at launch it was the cheapest Gemini model, at $0.0375 per 1M input tokens and $0.15 per 1M output tokens for prompts under 128K — half the price of Gemini 1.5 Flash
- 1M+ token context window despite an ~8B parameter footprint, with quality holding up to very long sequences in Google's long-context tests
- Native multimodal input across text, images, audio, and video — unusual for a single-digit-billion-parameter model
- High throughput and low latency, with double the rate limits of 1.5 Flash (up to 4,000 requests per minute), aimed at large-scale deployments
- Retained roughly 80–90% of full Gemini 1.5 Flash performance on many benchmarks while being far smaller and cheaper
- Free access through Google AI Studio and the Gemini API free tier
Best for
- High-volume chat, summarization, and classification where cost-per-token dominates
- Audio transcription and long-context language translation
- Large-scale automated data labeling to bootstrap training sets for other models
- High-throughput agent serving and integration into multi-model workflows
- Long-document and code analysis within a single 1M-token window
How to access
| Provider | Model ID |
|---|---|
| Google AI Studio / Gemini API ↗ | gemini-1.5-flash-8b |
Gemini Flash-Lite — every version
The full lineage of the Gemini Flash-Lite line, newest first. Every version has its own page — click any to compare specs, benchmarks and pricing.
| Version | Released | Context | License |
|---|---|---|---|
| Gemini 3.1 Flash-Litecurrent | 2026-03-03 | — | Proprietary |
| Gemini 2.5 Flash-Lite | 2025-06-17 | — | Proprietary |
| Gemini 2.0 Flash-Lite | 2025-02-01 | — | Proprietary |
| Gemini 1.5 Flash-8B | 2024-10-03 | — | Proprietary |
FAQ
When was Gemini 1.5 Flash-8B released, and is it still available?
Google made Gemini 1.5 Flash-8B (gemini-1.5-flash-8b-001) generally available on October 3, 2024. It is no longer available: Google deprecated the Gemini 1.5 line and shut the model down on September 24, 2025, so API calls now return errors. Google recommends migrating to newer Flash and Flash-Lite models such as Gemini 2.0 Flash.
How big is Gemini 1.5 Flash-8B and what could it do?
It was a roughly 8-billion-parameter transformer-decoder model distilled from Gemini 1.5 Flash. Despite its small size it was natively multimodal — accepting text, images, audio, and video — with a context window exceeding 1 million tokens (about 1,048,576 input tokens) and up to 8,192 output tokens. Google did not publish the exact parameter count.
How much did Gemini 1.5 Flash-8B cost?
At launch it was the cheapest Gemini model: $0.0375 per million input tokens, $0.15 per million output tokens, and $0.01 per million cached input tokens for prompts under 128K. Prices doubled for prompts longer than 128K, and a free tier was available through Google AI Studio and the Gemini API. As of its September 2025 shutdown it is no longer billable.
How well did Gemini 1.5 Flash-8B perform compared to Gemini 1.5 Flash?
Google's technical report showed Flash-8B reaching roughly 80–90% of full Gemini 1.5 Flash quality on many benchmarks. For example it scored 68.1% on MMLU (vs 78.9% for Flash), 50.3% on MMMU, 70.5% on MGSM, and 73.6% on DocVQA — a small quality trade-off in exchange for much higher throughput and lower cost.