TLDR: LLM-JEPA introduces a novel training objective for Large Language Models (LLMs) that adapts the successful Joint Embedding Predictive Architectures (JEPAs) from vision to language. By combining traditional generative loss with an embedding-space JEPA objective, LLM-JEPA significantly outperforms standard LLM training in finetuning and pretraining across various models and datasets, while also demonstrating robustness to overfitting and inducing structured representations.
Large Language Models (LLMs) have become central to many AI applications, but their training methods, primarily relying on input-space reconstruction and generative capabilities, differ significantly from successful approaches in computer vision. In vision, Joint Embedding Predictive Architectures (JEPAs) have shown superior performance by focusing on embedding-space training objectives. This difference has led researchers to question whether language models could benefit from vision-inspired training techniques.
A new research paper introduces LLM-JEPA, a pioneering solution that brings JEPA-style objectives to Large Language Models. This novel approach is applicable to both finetuning and pretraining LLMs, aiming to enhance their representation quality without sacrificing their generative abilities.
The core idea behind LLM-JEPA is to combine the standard LLM generative loss (which predicts the next token) with an additional JEPA objective. This JEPA component works by ensuring that different ‘views’ of the same underlying knowledge can be predicted from each other in the embedding space. For instance, in tasks involving both natural language and code, the text description and the corresponding code can be treated as two distinct views of the same concept. By learning to predict one view’s embedding from another, LLM-JEPA encourages the model to learn more abstract and robust representations.
The researchers empirically validated LLM-JEPA across a wide range of models, including families like Llama3, OpenELM, Gemma2, and Olmo, and numerous datasets such as NL-RX, GSM8K, Spider, and RottenTomatoes. The findings consistently show that LLM-JEPA significantly outperforms standard LLM training objectives. Beyond improved accuracy, the method also demonstrates remarkable robustness to overfitting, a common challenge in deep learning.
For example, in finetuning experiments, LLM-JEPA led to substantial accuracy gains across various models and datasets. In pretraining scenarios, it also improved the quality of learned representations, which then translated to better performance in downstream finetuning tasks. The paper also highlights that LLM-JEPA induces a more structured representation space, suggesting that it helps the model learn more meaningful and organized embeddings for text and code.
While LLM-JEPA offers significant advancements, the authors acknowledge a current limitation: the training process incurs a 3-fold increase in compute cost due to the need for multiple forward passes to obtain representations of different views. Future work aims to mitigate this by exploring methods to evaluate the LLM-JEPA loss within a single forward pass.
Also Read:
- Auxiliary Tasks: The Key to Robust Representations in JEPA Models
- Enhancing Speech LLMs: A Dual-Channel Approach to Overcome Forgetting and Modality Gaps
This research marks a crucial first step in adapting powerful vision-based self-supervised learning techniques to the realm of language models, promising more capable and robust AI systems. You can read the full research paper here: LLM-JEPA: Large Language Models Meet Joint Embedding Predictive Architectures.


