Google debuts DiffusionGemma, a text model built for speed with parallel generation

Experimental 26B Mixture-of-Experts model uses diffusion to produce blocks of text up to four times faster on dedicated GPUs

Google introduced DiffusionGemma, an experimental open model that uses text diffusion and parallel generation to deliver up to four times faster text production on dedicated GPUs. The 26 billion parameter Mixture of Experts model activates only 3.8 billion parameters at inference, fits within 18GB when quantized, and achieves more than 1,000 tokens per second on an NVIDIA H100 and over 700 tokens per second on GeForce RTX 5090 hardware. Google released the model under an Apache 2.0 license on Hugging Face and positioned it for speed-critical, interactive local workflows.

Google debuts DiffusionGemma, a text model built for speed with parallel generation

GOOGL NVDA

Summarize with

ChatGPT Perplexity Claude Grok Gemini

Key Points

DiffusionGemma is a 26 billion parameter Mixture of Experts model that activates 3.8 billion parameters during inference and can be quantized to fit within 18GB VRAM.
The model generates 256 tokens in parallel per forward pass and achieves over 1,000 tokens per second on a single NVIDIA H100 GPU and over 700 tokens per second on NVIDIA GeForce RTX 5090 hardware.
Released under an Apache 2.0 license on Hugging Face, DiffusionGemma targets speed-critical, interactive local workflows and is compatible with MLX, vLLM (with Red Hat integration), Hugging Face Transformers, Unsloth, and NVIDIA NeMo.

Google has released DiffusionGemma, an experimental open-source model that departs from the sequential token-by-token approach used by conventional language models and instead generates complete blocks of text in parallel. The company describes the model as capable of producing text up to four times faster than typical language models when run on purpose-built GPUs.

Architecturally, DiffusionGemma is a 26 billion parameter Mixture of Experts model. During inference the system activates only 3.8 billion parameters, and when quantized it can fit within an 18GB VRAM envelope typical of high-end consumer GPUs. That operational footprint is central to the model's appeal for local and interactive use cases.

Performance figures published by Google show DiffusionGemma exceeding 1,000 tokens per second on a single NVIDIA H100 GPU and topping 700 tokens per second on NVIDIA GeForce RTX 5090 hardware. The model produces 256 tokens in parallel in each forward pass, enabling bi-directional attention where every token can attend to all others. DiffusionGemma also iteratively refines its own outputs, making real-time corrections to its generations as it runs.

Google acknowledges a trade-off: while DiffusionGemma emphasizes generation speed and parallelism, its overall output quality is lower than that of standard Gemma 4 models. The company has positioned the model for researchers and developers focused on speed-critical and interactive local workflows - specifically citing scenarios such as in-line editing, rapid iteration, and the production of non-linear text structures.

DiffusionGemma has been released under an Apache 2.0 license on Hugging Face. Google says the model is compatible with a range of tooling and runtimes, including MLX, vLLM with Red Hat integration, Hugging Face Transformers, Unsloth, and NVIDIA NeMo.

On the hardware front, Google worked with NVIDIA to tune performance across multiple layers of the stack. Optimizations cover consumer-oriented GPUs such as GeForce RTX 5090 and 4090, as well as enterprise-grade Hopper and Blackwell systems running NVFP4 kernels.

For practitioners and organizations evaluating the trade-offs between speed and generation quality, DiffusionGemma presents a distinct option: substantially higher throughput through parallel diffusion-based generation in exchange for lower fidelity compared with Gemma 4.

Risks

Google notes that DiffusionGemma's overall output quality is lower than the company's standard Gemma 4 models - a trade-off between speed and fidelity relevant to applications requiring high-quality text.
The model's specialization for speed-critical, interactive local workflows may limit its suitability for production use cases where generation quality is paramount.
Hardware and software compatibility require specific optimizations (for example NVFP4 kernels on Hopper and Blackwell systems), which could pose deployment complexity for some enterprise environments.

Menu

Google debuts DiffusionGemma, a text model built for speed with parallel generation

Key Points

Risks

More from Stock Markets