Google Launches New Gemma 4 12B Model: Easily Handle Visual and Audio Data Without an Encoder

Recently, Google officially released its latest unified multimodal model - Gemma 4 12B. This model has 1.2 billion parameters, and its biggest highlight is that it does not require traditional multimodal encoders, and can directly process visual and audio data. To adapt to the needs of consumer-level hardware, Gemma 4 12B only requires 16GB of VRAM or unified memory, allowing users to run it locally on high-end laptops without relying on cloud computing resources.

The design innovation of Gemma 4 12B lies in eliminating the encoder components traditionally used in multimodal models. In the past, multimodal models needed to convert images and sounds through separate visual and audio encoders, while Gemma 4 12B uses a lightweight embedding layer to simplify the processing of visual inputs. It only needs one matrix multiplication, positional embedding, and normalization operation, significantly reducing computational complexity. At the same time, audio signals are directly projected into the dimension space of text tokens, eliminating the need for an audio encoder. This encoder-free design reduces the number of computational steps during inference and makes the model more compact.

In terms of performance, Gemma 4 12B approaches the level of Google's larger 26B MoE model, demonstrating excellent multi-step reasoning and agent workflow capabilities in multiple benchmark tests. In addition, the model is equipped with Multi-Token Prediction (MTP) drafters, which can predict multiple tokens simultaneously, thus accelerating the inference speed. As of now, the total number of downloads for the Gemma 4 series has exceeded 150 million, showing the enthusiasm of the developer community for this open-source model.

Gemma 4 12B is open-sourced under the Apache 2.0 license, and the weight files are available on platforms such as Hugging Face and Kaggle, supporting various inference frameworks, including LM Studio, Ollama, MLX, SGLang, and vLLM. In addition, Google's own AI Edge Gallery also provides support for edge deployment, and developers can perform large-scale production environment deployments through services such as Google Cloud's Model Garden, Cloud Run, and GKE.

Key Points:
🌟 The Gemma 4 12B model does not require traditional encoders and can directly process visual and audio data, with low operational requirements.
⚡ It uses a lightweight embedding layer, significantly reducing computational complexity, and its performance is close to that of Google's larger 26B MoE model.
📈 The cumulative download count has exceeded 150 million, supporting multiple inference frameworks and edge deployment, and is widely popular among developers.

OpenAI Restricts Release of GPT-5.0: Federal Regulation Intervenes, Access Requires Government-by-Government Approval

OpenAI has adjusted its GPT-5.0 release plan, following a request from the Trump administration, canceling the public launch and only opening it to a small number of close partners, using a government-by-government approval authorization model; if the restricted phase goes smoothly, full deployment will start within a few weeks.

16GB Memory, Local Instant Response! Google Releases Gemma 4 12B Revolutionary Encoder-Free Architecture Ignites Open Source Community

Google releases a new multimodal model Gemma 4 12B, revolutionizing the traditional architecture by eliminating the separate encoder component, achieving efficient local deployment and inference on consumer-level hardware. This breakthrough significantly reduces the computational complexity of multimodal models, improves processing speed, and marks a new stage in the open source large model ecosystem.

Human Game Experience Upgraded! Free and Open-Source AI Chess Engine Maia 3 Officially Released

The Maia Chess team released the open-source chess engine Maia 3, trained on 250 million human games, with an Elo rating of approximately 1800 points, an increase of nearly 300 points from the previous version. The engine is free and open-source, supports local deployment, and focuses on simulating human decision-making patterns, promoting the popularization of AI chess engines.

Google Launches New Gemma 4 12B Model: Easily Handle Visual and Audio Data Without an Encoder

Related Recommendations

Model Shrinks, Capabilities Remain: Sina VibeThinker-3B Brings a New Lightweight Approach to Open-Source AI Inference

OpenAI Restricts Release of GPT-5.0: Federal Regulation Intervenes, Access Requires Government-by-Government Approval

16GB Memory, Local Instant Response! Google Releases Gemma 4 12B Revolutionary Encoder-Free Architecture Ignites Open Source Community

Google Releases Open-Source Gemma412B Model: Focuses on Encoder-Free Multimodal with 16GB Memory Notebook for Local Execution

Human Game Experience Upgraded! Free and Open-Source AI Chess Engine Maia 3 Officially Released