最近,Meta AI 團隊帶來了 LongVU,這是一種新穎的時空自適應壓縮機制,旨在提升長視頻的語言理解能力。傳統的多模態大型語言模型(MLLMs)在處理長視頻時面臨着上下文長度的限制,而 LongVU 正是爲了解決這一難題而誕生。

LongVU 的工作原理主要通過過濾重複幀、跨幀token壓縮等方法來高效使用上下文長度,能夠在保持視頻視覺細節的同時減少視頻中的冗餘信息。

image.png

具體來說,團隊使用 DINOv2的特徵來剔除那些高度相似的冗餘幀。接着,通過文本引導的跨模態查詢,實現了選擇性地減少幀特徵的效果。

此外,LongVU 還針對幀間的時間依賴性進行了空間令牌的壓縮這一創新的壓縮策略使得 LongVU 能夠在有限的上下文長度內,有效地處理大量的幀,並且幾乎沒有視覺信息的損失。

在各種視頻理解基準測試中,LongVU 的表現均超越了現有的其他方法,尤其是在需要理解長達一小時的視頻任務中,如 VideoMME 和 MLVU 等。即便是在資源較輕的 LLM 下,LongVU 也能夠展現出卓越的表現,且模型規模較小。

在 LongVU 的應用示例中,用戶可以通過簡單的問題獲取視頻的詳細描述。例如,用戶詢問 “這段視頻的情況如何?”LongVU 能夠準確描述視頻中的場景、角色及其互動過程。無論是簡單的詢問視頻中的物體,還是複雜的情節描述,LongVU 都能夠迅速並準確地提供反饋。

官方示例:

用戶提示詞:Please provide a detailed description of the video.

LongVU給出的解答: The video begins with a scene set in a futuristic cityscape at night, where a humanoid robot stands in the foreground, its back to the camera, against a backdrop of illuminated buildings and a helicopter in flight. The setting is characterized by a blend of modern and cyberpunk elements, with neon signs and a wet street reflecting the city lights. As the video progresses, the robot remains stationary, with the helicopter moving closer, eventually hovering above the street, indicating a possible interaction or event unfolding between the robot and the helicopter.

The scene then shifts to an interior setting, likely a workshop or garage, filled with various electronic equipment and machinery. The robot is now in the foreground, facing the camera, with a person in the background, suggesting a narrative progression. The environment is cluttered, with the lighting casting a blue hue over the scene, emphasizing the technological theme.

As the video continues, the camera angle changes to show the robot from behind, with the person in the background now more visible, indicating movement. The workshop setting remains consistent, with the clutter and electronic equipment unchanged. The lighting and color scheme continue to highlight the technological ambiance.

The perspective then shifts to a side view of the robot, revealing more of its structure and the cluttered workshop environment. The person in the background is now closer to the robot, suggesting interaction or movement towards it. The lighting and color scheme remain consistent, maintaining the technological atmosphere.

Finally, the camera zooms in on the robot, focusing on its upper body and the intricate details of its internal components. The cluttered workshop environment becomes less visible, with the lighting casting a blue hue over the scene, emphasizing the technological theme. The camera's close-up shot highlights the robot's design and the complexity of its internal mechanisms, suggesting a narrative focus on the robot's internal workings or its role within the technological setting.

長視頻內容的快速增長需要更加高效的處理方式,LongVU 的推出無疑爲多模態理解領域帶來了新的可能性。

項目入口:https://vision-cair.github.io/LongVU/

劃重點:

1. 📽️ LongVU 是一種新型的時空自適應壓縮機制,旨在提升長視頻的語言理解能力。

2. 🔍 該技術利用 DINOv2特徵剔除冗餘幀,並通過跨模態查詢實現特徵選擇性壓縮。

3. 🚀 LongVU 在各種視頻理解基準測試中表現優異,尤其在長視頻理解任務中,超越了其他方法。