AI, But Simple
Posts
Vision Language Models (VL-JEPA), Simply Explained

Vision Language Models (VL-JEPA), Simply Explained

AI, But Simple Issue #93

Edwin Dong & Lalit Julapalli
March 16, 2026

Hello from the AI, but simple team! If you enjoy our content (with 10+ custom visuals), consider supporting us so we can keep doing what we do.

Our newsletter is not sustainable to run at no cost, so we’re relying on different measures to cover operational expenses. Thanks again for reading!

Vision Language Models (VL-JEPA), Simply Explained

AI, But Simple Issue #93

As we move forward with generative AI architectures, the industry is shifting from text-only interfaces toward multimodal systems capable of “seeing” the physical world.

At the core of this development are Vision Language Models (VLMs), combining computer vision (CV) and natural language processing (NLP) to interpret both images/videos and text simultaneously.

These VLMs typically are transformer-based, and they are built on the same autoregressive generation mechanism that powers LLM chatbots like ChatGPT, Claude, and Gemini.

The main difference is that VLMs take in visual input and a text prompt and output a response token “one-at-a-time,” a direct consequence of modern transformer attention.

VLMs are quite powerful for handling these multimodal inputs, but they are quite expensive, and a surprising amount of that cost goes toward modeling and computing things that may not matter for the final input. Essentially, the current state of VLMs is far from being the most optimized that they could be.

This article will break down VL-JEPA (Chen et al.), a new vision-language model architecture from Meta FAIR, described in this paper, which takes a different approach than standard VLMs.

Instead of generating token-by-token, VL-JEPA predicts. However, this prediction does not occur in the token space that generation occurs in. Rather, it occurs in the continuous embedding space itself and selectively converts into tokens/words.

Put together, VL-JEPA paves the way for a model that trains faster and uses fewer parameters for some tasks that VLMs couldn’t accomplish.

Subscribe to keep reading

This content is free, but you must be subscribed to AI, But Simple to continue reading.

Already a subscriber?Sign in.Not now