Gemma 4 12B: The Developer Guide
Gemma 4 12B drops separate encoders for a unified decoder-only architecture that handles text, images, and audio natively with shared weights. The model runs on 16 GB VRAM, serves as a local OpenAI-compatible API via LiteRT-LM, and fine-tunes through Hugging Face or Unsloth in a single pass. 38 variants are available on Ollama, including four fresh 12B MLX builds for Apple Silicon.
Source:
Google Developer Blog
· Published June 3, 2026
Google published a full developer guide for Gemma 4 12B on June 3, 2026, covering its new mid-sized multimodal model. The core technical change: separate vision and audio encoders are gone, replaced by a single decoder-only transformer where text, images, and audio share the same weights. No separate co-tuning required, no independent subsystems to maintain, and the whole thing fits in 16 GB of VRAM.
This guide covers everything needed to deploy, fine-tune, and integrate Gemma 4 12B: architecture, local deployment via LiteRT-LM, macOS desktop apps, fine-tuning paths, agentic development with the Gemma Skills Repository, and production options on Google Cloud. Google set this direction at Google I/O 2026 by shifting focus from model availability to operational integration. Gemma 4 12B is a direct response.
What changes with an encoder-free architecture?
Most multimodal models of comparable size combine an LLM with specialized encoders: a 27-layer vision transformer for images and a 12-layer conformer encoder for audio. These components train separately, then get frozen during LLM training. The cost for developers: higher inference latency from two encoding passes, mandatory co-tuning of LLM and encoders during fine-tuning, and two distinct subsystems to manage in production.
Gemma 4 12B removes this two-level architecture. The model uses the Gemma 4 31B Dense decoder, with two lightweight embedders replacing the encoders:
- Vision embedder (35M parameters): raw 48×48 pixel patches are projected directly into the LLM’s representation space via a single matrix multiplication. Spatial position is encoded via factorized coordinate lookup tables, with no additional transformer layers.
- Audio embedder: raw 16 kHz audio signals are sliced into 40ms frames (640 floats each) and projected linearly into the LLM input space, with no conformer encoder.
The practical result: vision, audio, and text share the exact same decoder weights. Adapter-based or full fine-tuning updates the entire multimodal pipeline in a single pass, with a single learning rate, using Hugging Face or Unsloth.
How does Gemma 4 12B handle vision and audio in practice?
For vision, the model uses a default visual token budget of 70 per image. In the demonstration published in the guide, Gemma 4 12B processed a 5-minute clip from the Google I/O 2026 keynote (timestamps 00:15:32-00:20:45) at 1 FPS, analyzing 313 frames. Asked to interpret a complex scene from Gemini Omni showing a generated “selfie”, the model correctly identified the visual metaphor: a multimodal AI’s ability to reimagine existing content and generate new scenarios from it.
For audio, Gemma 4 12B is the first mid-sized Gemma model with native audio input. Documented capabilities include automatic speech recognition (ASR), diarization (speaker identification in an audio stream), and agentic reasoning over audio content. In a local orchestration demo, the model used the OpenCode agent and gemma-skills to autonomously generate a Gradio image processing app, served entirely via llama.cpp.
One practical note on Ollama variants: the four Gemma 4 12B variants available at publication time are MLX-only, meaning text-only on Apple Silicon. Full multimodal support (text, image, and audio) will come with future q4_K_M, q8_0, and bf16 variants of the 12B. For image support right now, gemma4:e4b-it-q4_K_M (9.6 GB) and gemma4:e2b-it-q4_K_M (7.2 GB) are available.
Can Gemma 4 12B run locally?
The hardware requirement: 16 GB of VRAM (discrete GPU) or unified memory (Apple Silicon). The e4b-it-q4_K_M variant weighs 9.6 GB, leaving approximately 6 GB for context and KV cache on a 16 GB GPU. The e4b-it-bf16 variant (full precision) uses the full 16 GB.
LiteRT-LM: local OpenAI-compatible server. Google ships a new litert-lm serve CLI command that exposes Gemma 4 12B as a local API compatible with the OpenAI API spec, with stateless prefix caching to reduce latency on repeated requests sharing a common system context.
litert-lm import --from-huggingface-repo=litert-community/gemma-4-12B-it-litert-lm gemma-4-12B-it.litertlm gemma4-12b
litert-lm serve
Once running, it integrates with Continue, Aider, OpenClaw, Hermes, and OpenCode without additional configuration. The pattern is the same as Ollama or llama.cpp local servers, but LiteRT-LM is Google’s official tool optimized for the .litertlm format and the AI Edge ecosystem.
macOS desktop apps. Two applications run Gemma 4 12B fully offline on Apple Silicon:
- Google AI Edge Gallery (expanded to desktop): sandboxed execution with Python support for scientific computing, accessible to non-developers.
- Google AI Edge Eloquent: conversational voice interface with Voice Edit input support, enabling direct audio interactions with the model.
Other supported frameworks: LM Studio, Ollama, Hugging Face Transformers, llama.cpp, MLX, SGLang, and vLLM.
How do you fine-tune Gemma 4 12B?
The unified architecture has a direct impact on fine-tuning: there’s no decision to make about whether to fine-tune the encoder, the LLM, or both, and no differential learning rates between components. A LoRA adapter or a full fine-tune updates the entire multimodal pipeline in a single pass.
Via Hugging Face Transformers: the model is on Hugging Face with quick-start notebooks from Google. PEFT-compatible for adapter-based training (LoRA, QLoRA). For multimodal use cases (vision and text), the data preparation pipeline is identical to text-only fine-tuning: images are tokenized through the embedder and passed as additional tokens in the input sequence.
Via Unsloth: Unsloth supports Gemma 4 12B for memory-efficient training, reducing memory footprint by 30-60% on LoRA fine-tunes compared to the base Transformers implementation. Well-suited for 16-24 GB VRAM environments where a full fine-tune would not fit.
For teams working on multimodal tasks (code generation from screenshots, document analysis with visuals, audio transcription and structuring), the unified weights also simplify dataset construction: a mixed text, image, and audio dataset runs through a single pipeline without modality separation.
What is the Gemma Skills Repository?
Google simultaneously releases the Gemma Skills Repository, an official library of pre-packaged skills for building agents with Gemma models. The repository provides gemma-skills usable via agentic harnesses like OpenCode.
The example in the guide is direct: Gemma 4 12B used itself to build a local image processing application. The model generated the code for a Gradio app, reviewed it, and served it locally via llama.cpp, with OpenCode acting as coordinator. The entire workflow ran without cloud dependencies.
The Gemma Skills Repository positions Gemma 4 12B beyond text completion, as a foundation for local agentic workflows, at a time when governance and permission control of AI agents have become central concerns in enterprise adoption.
What production deployment options exist on Google Cloud?
For teams moving beyond local inference, Google offers three options in its ecosystem:
- Gemini Enterprise Agent Platform Model Garden: managed access within the Vertex AI ecosystem. Best for organizations already on Google Cloud that want a managed deployment with built-in monitoring.
- Cloud Run: serverless deployment, billed per use. Suitable for variable-traffic APIs or testing without fixed infrastructure costs.
- GKE (Google Kubernetes Engine): for sustained-load deployments with horizontal auto-scaling, suited to teams already managing a Kubernetes infrastructure.
In all three cases, the same model weights are used. There is no cloud-specific version of the weights, which enables hybrid architectures (local dev, cloud prod) or gradual migrations from cloud to local.
Which Gemma 4 variants are available on Ollama?
The Ollama library lists 38 Gemma 4 variants, added progressively over two months. The four Gemma 4 12B MLX variants were published just hours before this article was written. The range breaks into three families: standard models (quantized and bf16, supporting text and image), MLX variants for Apple Silicon (text only), and specialized variants (cloud, coding, MTP).
Gemma 4 12B variants on Ollama:
| Ollama Tag | Size | Context | Inputs | Format |
|---|---|---|---|---|
gemma4:12b-mlx |
10.0 GB | 128K | Text | MLX q4 |
gemma4:12b-mlx-bf16 |
24 GB | 128K | Text | MLX bf16 |
gemma4:12b-mxfp8 |
12 GB | 128K | Text | MLX fp8 |
gemma4:12b-nvfp4 |
10.0 GB | 128K | Text | MLX fp4 |
To get started: ollama run gemma4:12b-mlx
Full table of all 38 Gemma 4 variants:
| Ollama Tag | Size | Context | Inputs |
|---|---|---|---|
gemma4:latest |
9.6 GB | 128K | Text, Image |
gemma4:e2b |
7.2 GB | 128K | Text, Image |
gemma4:e4b |
9.6 GB | 128K | Text, Image |
gemma4:26b |
18 GB | 256K | Text, Image |
gemma4:31b |
20 GB | 256K | Text, Image |
gemma4:e2b-it-q4_K_M |
7.2 GB | 128K | Text, Image |
gemma4:e2b-it-q8_0 |
8.1 GB | 128K | Text, Image |
gemma4:e2b-it-bf16 |
10 GB | 128K | Text, Image |
gemma4:e2b-mlx |
7.1 GB | 128K | Text |
gemma4:e2b-mlx-bf16 |
10 GB | 128K | Text |
gemma4:e2b-mxfp8 |
7.9 GB | 128K | Text |
gemma4:e2b-nvfp4 |
7.1 GB | 128K | Text |
gemma4:e4b-it-q4_K_M |
9.6 GB | 128K | Text, Image |
gemma4:e4b-it-q8_0 |
12 GB | 128K | Text, Image |
gemma4:e4b-it-bf16 |
16 GB | 128K | Text, Image |
gemma4:e4b-mlx |
9.6 GB | 128K | Text |
gemma4:e4b-mlx-bf16 |
16 GB | 128K | Text |
gemma4:e4b-mxfp8 |
11 GB | 128K | Text |
gemma4:e4b-nvfp4 |
9.6 GB | 128K | Text |
gemma4:12b-mlx |
10.0 GB | 128K | Text |
gemma4:12b-mlx-bf16 |
24 GB | 128K | Text |
gemma4:12b-mxfp8 |
12 GB | 128K | Text |
gemma4:12b-nvfp4 |
10.0 GB | 128K | Text |
gemma4:26b-a4b-it-q4_K_M |
18 GB | 256K | Text, Image |
gemma4:26b-a4b-it-q8_0 |
28 GB | 256K | Text, Image |
gemma4:26b-mlx |
17 GB | 256K | Text |
gemma4:26b-mlx-bf16 |
52 GB | 256K | Text |
gemma4:26b-mxfp8 |
27 GB | 256K | Text |
gemma4:26b-nvfp4 |
17 GB | 256K | Text |
gemma4:31b-cloud |
— | 256K | Text, Image |
gemma4:31b-coding-mtp-bf16 |
64 GB | 256K | Text |
gemma4:31b-it-q4_K_M |
20 GB | 256K | Text, Image |
gemma4:31b-it-q8_0 |
34 GB | 256K | Text, Image |
gemma4:31b-it-bf16 |
63 GB | 256K | Text, Image |
gemma4:31b-mlx |
20 GB | 256K | Text |
gemma4:31b-mlx-bf16 |
63 GB | 256K | Text |
gemma4:31b-mxfp8 |
32 GB | 256K | Text |
gemma4:31b-nvfp4 |
20 GB | 256K | Text |
FAQ
What is the difference between Gemma 4 12B and the e2b/e4b models?
The e2b (2 billion parameters) and e4b (4 billion) models are the edge variants of Gemma 4, designed for memory-constrained devices such as mobile or edge computing hardware. Gemma 4 12B is a new dense mid-sized model, more capable on complex tasks, with native audio input and a 128K context window. All share the unified decoder-only architecture.
Can Gemma 4 12B be used multimodally on Ollama right now?
The four 12B variants on Ollama are currently MLX-only, meaning text-only on Apple Silicon. For image support right now, gemma4:e4b-it-q4_K_M (9.6 GB) or gemma4:e2b-it-q4_K_M (7.2 GB) are available with text and image input. Full 12B variants with image and audio support should arrive in the coming weeks.
Is LiteRT-LM a replacement for llama.cpp or Ollama?
No, they are complementary tools. LiteRT-LM is Google’s official tool for the .litertlm format, optimized for Gemma models and the AI Edge ecosystem. llama.cpp and Ollama support Gemma 4 via their own GGUF conversions. All three expose a local OpenAI-compatible endpoint. The choice depends on which tool ecosystem is already in place.
Can Gemma 4 12B replace a cloud model for RAG or agentic workflows?
For RAG over internal documents, yes in many cases. The 128K context window covers most document sizes, the multimodal architecture handles PDFs with images, and LiteRT-LM or Ollama expose an endpoint compatible with LangChain, LlamaIndex, or any RAG framework. For complex agentic workloads with parallel tool calls, larger cloud models remain superior. But Gemma 4 12B covers a wide range of use cases at lower cost and without data exposure.
Read the full developer guide →
Gemma 4 12B asks a practical question for teams building AI pipelines today: how much of the multimodal stack actually needs to run in the cloud? A 9.6 GB Q4 model that handles text, images, and audio with shared weights, serves as a local OpenAI-compatible API, and fine-tunes in a single pass shifts the calculus on data sovereignty and inference costs. With 38 variants on Ollama, a Gemma Skills Repository for agentic workflows, and offline macOS desktop apps, the Gemma 4 family has become a credible foundation for local-first AI, precisely as agent governance and security move to the top of the enterprise AI agenda.
