VSTAT：多模态大模型视频视觉状态跟踪基准测试

Saining Xie@sainingxie

2026-06-03 11:20·12天前

AI 摘要

研究团队推出VSTAT基准测试，用于评估多模态大语言模型（MLLMs）在视频中追踪动态状态的能力。测试任务看似简单，包括计数杯子、识别键入的文字、统计翻页次数等，人类可以轻松完成，但当前MLLMs表现欠佳。该测试旨在推动视觉状态跟踪这一前沿方向的发展，解决模型从不完整、有噪声的视觉观察中建立和更新内部世界状态的核心挑战。

how does the brain build and track an internal state of the world from （possibly incomplete and noisy） visual observations？ i believe visual state tracking will be the grand challenge for vision in the coming years， and i hope this benchmark can be a useful starting line. enjoy！

Sihyun YuCan MLLMs actually track what's happening in a video? Introducing VSTAT 🎯, our new benchmark for visual state tracking. The tasks are simple: count cups, read ...

多模态视频评测/基准

在 X 查看原推

Saining Xie@sainingxie · X

2026-06-03 11:20·12天前

AI 摘要

Sihyun YuCan MLLMs actually track what's happening in a video? Introducing VSTAT 🎯, our new benchmark for visual state tracking. The tasks are simple: count cups, read ...

多模态视频评测/基准

在 X 查看原推x.com