Google officially launched Gemini Embedding2 around March 10, 2026. It is the company's first fully multimodal embedding model based on the Gemini architecture. The model is now available for Public Preview on the Gemini API and Vertex AI, allowing developers to immediately call and experience it.
Unified Embedding Space, Breaking Modal Barriers
The core innovation of Gemini Embedding2 lies in mapping various data types, such as text, images, videos, audio, and documents (PDFs), into a single unified embedding vector space. This design completely realizes cross-modal retrieval and classification, supporting over 100 languages, truly enabling different modal data to "speak the same language."

Mixed Input Capabilities, Precisely Capturing Semantic Relationships
The model natively supports mixed modal input, such as combining images with text, videos with audio, and other complex combinations. The system can deeply understand semantic relationships between different media, rather than simply processing them in parallel, bringing a qualitative leap in multimedia content understanding.
Native Audio Processing, No Need for ASR Transcription
Another major breakthrough is the direct audio embedding feature. Users can directly input original audio files, and the model can output high-quality embedding vectors without first performing speech-to-text (ASR) conversion. This significantly simplifies the multimodal data processing workflow and notably reduces latency and computing costs.
Wide Application Scenarios, Marking a New Era for RAG
