Google has recently released its native multimodal embedding model Gemini Embedding2, which maps text, images, videos, audio, and PDF documents into the same semantic vector space, aiming to simplify complex AI data processing workflows and enhance multimodal retrieval and understanding capabilities. This marks an important step for Google in moving from single-text semantic representation to unified multimodal semantic modeling in the field of embedding technology.

QQ20260312-085930.jpg

Previously, in July 2025, Google launched a text embedding model gemini-embedding-001 supporting over 100 languages and achieved leading results on the MTEB multilingual leaderboard. The newly released Gemini Embedding2 is still based on the Gemini architecture, but its capabilities have been further expanded, allowing it to process five modalities: text, images, videos, audio, and PDF documents, mapping them into a unified vector space. This enables direct semantic comparison between different media content without relying on multiple models or additional processing steps. This capability holds significant importance for applications such as semantic search, retrieval-augmented generation (RAG), sentiment analysis, and data clustering.

In terms of input capabilities, the new model supports up to 8192 text tokens, four times the 2048-token limit of the previous model; it can process up to six PNG or JPEG images per request, with a maximum video duration of 120 seconds, and up to six pages of PDF documents. Notably, Gemini Embedding2 also supports native audio processing without the need for speech-to-text conversion, avoiding information loss during traditional transcription processes. Google also introduced "interleaved input" technology, allowing developers to mix multiple modalities in a single request, such as combining images with text descriptions, to better capture semantic relationships between different media types.

QQ20260312-085920.jpg

In terms of architecture, the model continues to use the Matryoshka Representation Learning (MRL) technique, dynamically adjusting vector dimensions through a hierarchical information structure. Its default embedding dimension is 3072, with optional configurations available at 1536 and 768, allowing developers to flexibly balance retrieval quality and storage costs.

Google's benchmark tests show that Gemini Embedding2 achieves leading performance in text, image, video, and speech tasks. For example, in text-video retrieval tasks, the model scores 68.8, surpassing Amazon Nova2Multimodal Embeddings' score of 60.3 and Voyage Multimodal3.5's score of 55.2; in text-image comparison tasks, its score is 93.4, significantly outperforming Amazon's model with a score of 84.0.

Currently, Gemini Embedding2 is available to developers through Gemini API and Vertex AI, and supports integration with mainstream frameworks and vector databases such as LangChain, LlamaIndex, Haystack, Weaviate, Qdrant, ChromaDB, and Vector Search. Google also provides interactive Colab notebooks and lightweight multimodal semantic search demos to help developers quickly test the model's capabilities.

QQ20260312-085906.jpg

Notably, competition in the multimodal embedding field is intensifying. In late February this year, the AI search engine Perplexity released open-source embedding models pplx-embed-v1 and pplx-embed-context-v1