商汤科技近日开源了SenseNova U1,其核心创新在于架构设计。该模型摒弃了传统的视觉编码器与变分自编码器分离结构,采用单一共享表示空间原生处理图像与文本,极大减少了模块间转换导致的信息损耗。这一设计使模型能够连贯地同时生成图文内容,在信息图、海报、漫画等需要高一致性的密集视觉内容创作上优势显著。性能方面,其信息图生成速度在同等质量下约为Qwen-Image-2.0/Seedream-4.5的两倍。
Chinese AI lab SenseTime just open-sourced SenseNova U1, a unified multimodal model that can understand, reason, and generate images + text inside 1 model.
The interesting part is the architecture: it removes the usual visual encoder and variational auto-encoder setup, then handles image and language inside a shared representation space, instead of being passed between separate modules.
That means less handoff between modules, less information loss, and better consistency when creating dense visual content like infographics, guides, posters, comics, and image-text workflows.
That's how the model can generate coherent text and images together in one flow, which is why it is strong for infographics, guides, comics, posters, and step-by-step visual content.
For infographic generation specifically, it is also around 2x faster than Qwen-Image-2.0 / Seedream-4.5 while staying in the same rough quality band, based on the client benchmark chart. 1/n