In the wave of artificial intelligence moving towards real-time interaction, JD.com has officially open-sourced its core achievement - the real-time video vision-language interaction model JoyAI-VL-Interaction. As the world's first fully open-source interactive vision model, the system not only received deep support from vLLM-Omni, but also marks the official transition of AI assistants from traditional "passive response" to "watch and speak" autonomous observation mode.

Compared to the previous lagging mode where processing of the video would start only after the user asked a question, JoyAI-VL-Interaction demonstrates a high level of initiative. It has the ability to continuously observe the video stream, intelligently determining when to intervene in the conversation and when to remain silent, thus providing a more natural and smooth experience in interactions.

image.png

This improvement in real-time response capability is crucial for handling dynamic information. Traditional video understanding technologies are often limited by the "upload first, then analyze" process, which is difficult to meet the needs in scenarios requiring high real-time performance such as security monitoring, live broadcasting interpretation, or operation guidance. JoyAI-VL-Interaction can process the ongoing video stream immediately, truly achieving synchronization between image changes and intelligent responses.

A more technical highlight is its "background delegation" mechanism. When facing high-level tasks such as generating code, complex reasoning, or tool calls, the model can flexibly offload tasks to the background Agent system, while the front-end model continues to maintain real-time observation of the scene. This parallel workflow of "observation and interaction" allows the AI assistant to maintain seamless communication with users while executing complex logic.

In terms of compatibility and scalability, the model supports various video input sources such as cameras, live streams, and various surveillance signals, and allows developers to flexibly replace ASR, TTS, long-term memory modules, or external API interfaces according to business needs.