AI/TLDR

Gemini 1.5 Flash-8B

Google's smallest, cheapest Gemini 1.5 model — an ~8B multimodal workhorse with a 1M-token context (now retired).

Overview

Gemini 1.5 Flash-8B was Google's smallest and lowest-cost model in the Gemini 1.5 family. Announced in preview during 2024 and made generally available on October 3, 2024 (as gemini-1.5-flash-8b-001), it was a transformer-decoder model derived from Gemini 1.5 Flash — distilled down to roughly 8 billion parameters while keeping the same multimodal capabilities and a context window exceeding 1 million tokens. Google described it as having the lowest cost per intelligence of any Gemini model at the time, and made it the seed of what later became the Gemini Flash-Lite tier.

Despite its tiny size, Gemini 1.5 Flash-8B accepted text, images, audio, and video as input and could read up to about 1,048,576 tokens of context (returning up to 8,192 output tokens). Google's own technical report showed it reaching roughly 80–90% of full Gemini 1.5 Flash quality on many benchmarks, making it well suited to high-volume, latency-sensitive jobs: chat, transcription, long-context translation, large-scale data labeling, and high-throughput agent serving. It launched at half the price of 1.5 Flash with double the rate limits and was available free through Google AI Studio and the Gemini API.

Gemini 1.5 Flash-8B is no longer a current model. Google deprecated the entire Gemini 1.5 line and shut down gemini-1.5-flash-8b on September 24, 2025, after which API calls return errors. Google steered users toward the cheaper, faster successors in the Flash and Flash-Lite lines (Gemini 2.0 Flash / Flash-Lite and later 2.5 Flash-Lite). This page documents it as a historical model.

Released2024-10-03
LicenseProprietary
WeightsAPI only
Parameters~8B (single-digit billions; exact count undisclosed)
Context1M
Max output8K
ArchitectureTransformer decoder model derived from Gemini 1.5 Flash, inheriting the same core architecture, optimizations, and data-mixture refinements at a smaller (~8B parameter) scale, with native multimodal support and a 1M+ token context window. Google did not publish the full parameter count.
Knowledge cutoffAugust 2024
ModalitiesText, Vision, Audio, Video
StatusDiscontinued (shut down September 24, 2025)

Benchmarks

  1. MMLU (5-shot)68.1%
  2. GPQA (0-shot)30.8%
  3. MATH (4-shot, Minerva prompt)35.9%
  4. BigBench-Hard (3-shot)69.5%
  5. Natural2Code (0-shot)67.6%
  6. MGSM (8-shot)70.5%
  7. MMMU (val, 4-shot)50.3%
  8. DocVQA (0-shot)73.6%
  9. TextVQA (0-shot)66.7%

Scores on a 0–100 scale (25-point gridlines); higher is better. Each benchmark links to its published source.

Pricing

Input$0.0375 / 1M tokens (prompts < 128K) per 1M tokens
Cached input$0.01 / 1M tokens (prompts < 128K) per 1M tokens
Output$0.15 / 1M tokens (prompts < 128K) per 1M tokens

Launch pricing from October 2024; prices doubled for prompts longer than 128K tokens. A free tier was available via Google AI Studio and the Gemini API. The model was shut down on September 24, 2025 and is no longer billable.

Pricing source ↗

Strengths

  • Extremely low cost: at launch it was the cheapest Gemini model, at $0.0375 per 1M input tokens and $0.15 per 1M output tokens for prompts under 128K — half the price of Gemini 1.5 Flash
  • 1M+ token context window despite an ~8B parameter footprint, with quality holding up to very long sequences in Google's long-context tests
  • Native multimodal input across text, images, audio, and video — unusual for a single-digit-billion-parameter model
  • High throughput and low latency, with double the rate limits of 1.5 Flash (up to 4,000 requests per minute), aimed at large-scale deployments
  • Retained roughly 80–90% of full Gemini 1.5 Flash performance on many benchmarks while being far smaller and cheaper
  • Free access through Google AI Studio and the Gemini API free tier

Best for

  • High-volume chat, summarization, and classification where cost-per-token dominates
  • Audio transcription and long-context language translation
  • Large-scale automated data labeling to bootstrap training sets for other models
  • High-throughput agent serving and integration into multi-model workflows
  • Long-document and code analysis within a single 1M-token window

How to access

ProviderModel ID
Google AI Studio / Gemini API ↗gemini-1.5-flash-8b

Gemini Flash-Lite — every version

The full lineage of the Gemini Flash-Lite line, newest first. Every version has its own page — click any to compare specs, benchmarks and pricing.

VersionReleasedContextLicense
Gemini 3.1 Flash-Litecurrent2026-03-03Proprietary
Gemini 2.5 Flash-Lite2025-06-17Proprietary
Gemini 2.0 Flash-Lite2025-02-01Proprietary
Gemini 1.5 Flash-8B2024-10-03Proprietary

FAQ

When was Gemini 1.5 Flash-8B released, and is it still available?

Google made Gemini 1.5 Flash-8B (gemini-1.5-flash-8b-001) generally available on October 3, 2024. It is no longer available: Google deprecated the Gemini 1.5 line and shut the model down on September 24, 2025, so API calls now return errors. Google recommends migrating to newer Flash and Flash-Lite models such as Gemini 2.0 Flash.

How big is Gemini 1.5 Flash-8B and what could it do?

It was a roughly 8-billion-parameter transformer-decoder model distilled from Gemini 1.5 Flash. Despite its small size it was natively multimodal — accepting text, images, audio, and video — with a context window exceeding 1 million tokens (about 1,048,576 input tokens) and up to 8,192 output tokens. Google did not publish the exact parameter count.

How much did Gemini 1.5 Flash-8B cost?

At launch it was the cheapest Gemini model: $0.0375 per million input tokens, $0.15 per million output tokens, and $0.01 per million cached input tokens for prompts under 128K. Prices doubled for prompts longer than 128K, and a free tier was available through Google AI Studio and the Gemini API. As of its September 2025 shutdown it is no longer billable.

How well did Gemini 1.5 Flash-8B perform compared to Gemini 1.5 Flash?

Google's technical report showed Flash-8B reaching roughly 80–90% of full Gemini 1.5 Flash quality on many benchmarks. For example it scored 68.1% on MMLU (vs 78.9% for Flash), 50.3% on MMMU, 70.5% on MGSM, and 73.6% on DocVQA — a small quality trade-off in exchange for much higher throughput and lower cost.