用于学习语义丰富视觉表征的文本条件JEPA · AI HOT

Apple Machine Learning Research（RSS）

精选69

用于学习语义丰富视觉表征的文本条件JEPA

2026-05-07 08:00·39天前

精选理由

Apple 这篇 TC-JEPA 把文本融入自监督视觉预训练，用稀疏交叉注意力减少预测不确定性，对多模态表征学习是个不错的思路补充，做视觉模型的值得一看。

AI 摘要

研究人员提出文本条件联合嵌入预测架构（TC-JEPA），通过引入图像描述文本作为条件信息来降低掩码特征预测中的视觉不确定性。该方法采用细粒度文本调节器，对输入文本标记计算稀疏交叉注意力，从而调制预测的图像补丁特征。与基于掩码特征预测的I-JEPA相比，TC-JEPA能够学习到语义更丰富的视觉表征，解决了原有方法因视觉不确定性导致的语义学习不足问题。

原文 · 未翻译

Text-Conditional JEPA for Learning Semantically Rich Visual Representations

Text-Conditional JEPA for Learning Semantically Rich Visual Representations

AuthorsChen Huang, Xianhang Li, Vimal Thilak, Etai Littwin, Josh Susskind

View publication

Copy Bibtex

Image-based Joint-Embedding Predictive Architecture (I-JEPA) offers a promising approach to visual self-supervised learning through masked feature prediction. However with the inherent visual uncertainty at masked positions, feature prediction remains challenging and may fail to learn semantic representations. In this work, we propose Text-Conditional JEPA (TC-JEPA) that uses image captions to reduce the prediction uncertainty. Specifically, we modulate the predicted patch features using a fine-grained text conditioner that computes sparse cross-attention over input text tokens. With such conditioning, patch features become predictable as a function of text, thus are more semantically meaningful. We show TC-JEPA improves downstream performance and training stability, with promising scaling properties. TC-JEPA also offers a new vision-language pretraining paradigm based on feature prediction only, outperforming contrastive methods on diverse tasks, especially those requiring fine-grained visual understanding and reasoning.

Related readings and updates.

Rethinking JEPA: Compute-Efficient Video SSL with Frozen Teachers

Rethinking JEPA: Compute-Efficient Video SSL with Frozen Teachers

October 8, 2025research area Computer Vision, research area Methods and Algorithmsconference ICLR

Video Joint Embedding Predictive Architectures (V-JEPA) learn generalizable off-the-shelf video representation by predicting masked regions in latent space with an exponential moving average (EMA)-updated teacher. While EMA prevents representation collapse, it complicates scalable model selection and couples teacher and student architectures. We revisit masked-latent prediction and show that a frozen teacher suffices. Concretely, we (i) train a…

How JEPA Avoids Noisy Features: The Implicit Bias of Deep Linear Self Distillation Networks

How JEPA Avoids Noisy Features: The Implicit Bias of Deep Linear Self Distillation Networks

November 20, 2024research area Computer Vision, research area Methods and Algorithmsconference NeurIPS

Two competing paradigms exist for self-supervised learning of data representations. Joint Embedding Predictive Architecture (JEPA) is a class of architectures in which semantically similar inputs are encoded into representations that are predictive of each other. A recent successful approach that falls under the JEPA framework is self-distillation, where an online encoder is trained to predict the output of the target encoder, sometimes using a…

Discover opportunities in Machine Learning.

Our research in machine learning breaks new ground every day.

Work with us

多模态数据/训练论文/研究

Apple Machine Learning Research（RSS）

精选69

用于学习语义丰富视觉表征的文本条件JEPA

2026-05-07 08:00·39天前

精选理由

Apple 这篇 TC-JEPA 把文本融入自监督视觉预训练，用稀疏交叉注意力减少预测不确定性，对多模态表征学习是个不错的思路补充，做视觉模型的值得一看。

AI 摘要

研究人员提出文本条件联合嵌入预测架构（TC-JEPA），通过引入图像描述文本作为条件信息来降低掩码特征预测中的视觉不确定性。该方法采用细粒度文本调节器，对输入文本标记计算稀疏交叉注意力，从而调制预测的图像补丁特征。与基于掩码特征预测的I-JEPA相比，TC-JEPA能够学习到语义更丰富的视觉表征，解决了原有方法因视觉不确定性导致的语义学习不足问题。

原文 · 保持原样，未翻译

Text-Conditional JEPA for Learning Semantically Rich Visual Representations

Text-Conditional JEPA for Learning Semantically Rich Visual Representations

AuthorsChen Huang, Xianhang Li, Vimal Thilak, Etai Littwin, Josh Susskind

View publication

Copy Bibtex

Image-based Joint-Embedding Predictive Architecture (I-JEPA) offers a promising approach to visual self-supervised learning through masked feature prediction. However with the inherent visual uncertainty at masked positions, feature prediction remains challenging and may fail to learn semantic representations. In this work, we propose Text-Conditional JEPA (TC-JEPA) that uses image captions to reduce the prediction uncertainty. Specifically, we modulate the predicted patch features using a fine-grained text conditioner that computes sparse cross-attention over input text tokens. With such conditioning, patch features become predictable as a function of text, thus are more semantically meaningful. We show TC-JEPA improves downstream performance and training stability, with promising scaling properties. TC-JEPA also offers a new vision-language pretraining paradigm based on feature prediction only, outperforming contrastive methods on diverse tasks, especially those requiring fine-grained visual understanding and reasoning.

Related readings and updates.

Rethinking JEPA: Compute-Efficient Video SSL with Frozen Teachers

Rethinking JEPA: Compute-Efficient Video SSL with Frozen Teachers

October 8, 2025research area Computer Vision, research area Methods and Algorithmsconference ICLR

Video Joint Embedding Predictive Architectures (V-JEPA) learn generalizable off-the-shelf video representation by predicting masked regions in latent space with an exponential moving average (EMA)-updated teacher. While EMA prevents representation collapse, it complicates scalable model selection and couples teacher and student architectures. We revisit masked-latent prediction and show that a frozen teacher suffices. Concretely, we (i) train a…