Google has recently introduced an innovative feature for its Gemini 2.5 AI model - "Conversational Image Segmentation," allowing users to analyze and highlight image content directly through natural language prompts. This technology goes beyond traditional image segmentation, enabling Gemini to understand and respond to more complex and semantic instructions.
Going Beyond Traditional Methods, Understanding Abstracts and Relationships
Traditional image segmentation usually focuses on identifying fixed categories of objects such as "dogs," "cars," or "chairs." Now, Gemini can understand and apply more complex language to specific parts of an image. It is capable of handling: relational queries, for example, "a person with an umbrella." Logic-based instructions, for example, "all people not sitting." Abstract concepts, even recognizing concepts like "clutter" or "damage" that do not have clear visual outlines.
In addition, thanks to the built-in text recognition feature, Gemini can also identify image elements that require reading on-screen text, such as "cashew candy" in a display case. This feature supports multilingual prompts and can provide object labels in other languages (such as French) as needed.
Wide Applications: From Design to Safety and Insurance
Google states that this technology has broad practical value in multiple fields: image editing: designers no longer need a mouse or selection tools, they can precisely select the desired area by simply giving verbal instructions, such as "select the shadow of the building." Workplace safety: Gemini can scan photos or videos and automatically identify violations, for example, "all people without helmets at the construction site." Insurance industry: claims adjusters can issue commands such as "highlight all buildings damaged by the storm," automatically marking damaged buildings in aerial images, significantly saving manual inspection time.
Developer-Friendly: API Access and Optimization Tips
This powerful feature does not require a special standalone model. Developers can access the "Conversational Image Segmentation" feature directly through the Gemini API, with all requests handled directly by Gemini models equipped with this feature.
The returned results are in JSON format, containing coordinates (box_2d
), pixel masks (mask
), and descriptive labels (label
) of the selected image areas, providing convenience for subsequent development.
To achieve the best results, Google recommends using the gemini-2.5-flash
model and setting the thinkingBudget
parameter to zero to trigger an immediate response. Developers can conduct preliminary tests through Google AI Studio or Python Colab.