研究团队推出VSTAT基准测试,用于评估多模态大语言模型(MLLMs)在视频中追踪动态状态的能力。测试任务看似简单,包括计数杯子、识别键入的文字、统计翻页次数等,人类可以轻松完成,但当前MLLMs表现欠佳。该测试旨在推动视觉状态跟踪这一前沿方向的发展,解决模型从不完整、有噪声的视觉观察中建立和更新内部世界状态的核心挑战。
how does the brain build and track an internal state of the world from (possibly incomplete and noisy) visual observations? i believe visual state tracking will be the grand challenge for vision in the coming years, and i hope this benchmark can be a useful starting line. enjoy!