Google Releases Gemini Embedding2: Native Multimodal Embedding Model Unifies Text, Image, and Audio-Visual Semantic Spaces

Google has recently released its native multimodal embedding model Gemini Embedding2, which maps text, images, videos, audio, and PDF documents into the same semantic vector space, aiming to simplify complex AI data processing workflows and enhance multimodal retrieval and understanding capabilities. This marks an important step for Google in moving from single-text semantic representation to unified multimodal semantic modeling in the field of embedding technology.

Previously, in July 2025, Google launched a text embedding model gemini-embedding-001 supporting over 100 languages and achieved leading results on the MTEB multilingual leaderboard. The newly released Gemini Embedding2 is still based on the Gemini architecture, but its capabilities have been further expanded, allowing it to process five modalities: text, images, videos, audio, and PDF documents, mapping them into a unified vector space. This enables direct semantic comparison between different media content without relying on multiple models or additional processing steps. This capability holds significant importance for applications such as semantic search, retrieval-augmented generation (RAG), sentiment analysis, and data clustering.

In terms of input capabilities, the new model supports up to 8192 text tokens, four times the 2048-token limit of the previous model; it can process up to six PNG or JPEG images per request, with a maximum video duration of 120 seconds, and up to six pages of PDF documents. Notably, Gemini Embedding2 also supports native audio processing without the need for speech-to-text conversion, avoiding information loss during traditional transcription processes. Google also introduced "interleaved input" technology, allowing developers to mix multiple modalities in a single request, such as combining images with text descriptions, to better capture semantic relationships between different media types.

In terms of architecture, the model continues to use the Matryoshka Representation Learning (MRL) technique, dynamically adjusting vector dimensions through a hierarchical information structure. Its default embedding dimension is 3072, with optional configurations available at 1536 and 768, allowing developers to flexibly balance retrieval quality and storage costs.

Google's benchmark tests show that Gemini Embedding2 achieves leading performance in text, image, video, and speech tasks. For example, in text-video retrieval tasks, the model scores 68.8, surpassing Amazon Nova2Multimodal Embeddings' score of 60.3 and Voyage Multimodal3.5's score of 55.2; in text-image comparison tasks, its score is 93.4, significantly outperforming Amazon's model with a score of 84.0.

Currently, Gemini Embedding2 is available to developers through Gemini API and Vertex AI, and supports integration with mainstream frameworks and vector databases such as LangChain, LlamaIndex, Haystack, Weaviate, Qdrant, ChromaDB, and Vector Search. Google also provides interactive Colab notebooks and lightweight multimodal semantic search demos to help developers quickly test the model's capabilities.

Notably, competition in the multimodal embedding field is intensifying. In late February this year, the AI search engine Perplexity released open-source embedding models pplx-embed-v1 and pplx-embed-context-v1

Google Releases Gemini Embedding2: Native Multimodal Embedding Model Unifies Text, Image, and Audio-Visual Semantic Spaces

Related Recommendations

The Wind Has Risen! Tencent Launches a China-specific AI Community with 13,000 Skills Accessible at One Click

From Chip Giant to Full-Stack Player: NVIDIA Plans to Invest $26 Billion in Open-Weight Models

Breaking Traditional Data Analysis Bottlenecks: The Intelligent Agent DataAgent Based on Spring AI Alibaba is Officially Released

OpenRouter Launches Anonymous Models Hunter Alpha and Healer Alpha: Up to 1T Parameters, Support for Multimodal Input

Didi Launches AI Vehicle Comparison Assistant: Accurately Solving the Problem of Choosing a Moving Car