An Introduction to Vision Large Language Models

Authors: Jian Hu

First published on: 2024/12/1

Background

Multimodal large language models (MLLMs) are designed to process and integrate information from various modalities, such as text, images, and audio. Their architectures often leverage advanced techniques like cross-attention and token concatenation to enhance their performance in understanding and generating multimodal content.

For visual large models, the first step is to convert images or videos into a format that can be recognized by large language models (LLMs). Typically, this involves dividing the images into patches, which are then encoded using a Vision Transformer (ViT) network. After encoding, these patches are fed into a projector MLP (multi-layer perceptron) to transform them into the embedding space of the LLM. Finally, input these visual tokens and text tokens into the LLM. We will demonstrate this process step by step using Qwen2-VL, Llama3.2 Vision and InternVL-2.5 as examples.

Image processor

Image Patches

In the image processor, Qwen2-VL will process the input image. Overall, Qwen2-VL does two things at this step:

Resize the image so that both its height and width are integer multiples of the patch size
- For example, the dimensions of the image become $(\text{count}_h \times \text{patch\_size}) \times (\text{count}_w \times \text{patch\_size})$.
Flatten the image into $(\text{count}_h \times \text{count}_w)$ small patches of size $(\text{patch\_size} \times \text{patch\_size})$
- This essentially serves as the input for transformers as $L \times h$.
- Each small patch has a resolution of $h = (\text{patch\_size} \times \text{patch\_size})$.
- The length $L = (\text{count}_h \times \text{count}_w)$.

Some models, such as Llama3.2 vision and InternVL-2.5, first find the closet aspect ratio and then resize the images:

Temporal dimension (Video)

What if the content we want to process is not an image, but a video? What additional processing do we need to perform? Qwen2-VL segment the patches along the temporal dimension as well, the temporal_patch_size is use to enhance the video temporal features:

self.temporal_patch_size = 2
channel * self.temporal_patch_size * self.patch_size * self.patch_size

For a video, the number of patches sent to the ViT can be described as:

grid_t * grid_h * grid_w

Therefore, in the image processor, you can see:

# This is key to understanding what the image processor is doing
flatten_patches = patches.reshape(
            grid_t * grid_h * grid_w, channel * self.temporal_patch_size * self.patch_size * self.patch_size
)

Then, Qwen2-VL "stack" each image into two frames, meaning that one image becomes a "two-frame identical" "small video," allowing both images and videos to use the same patch segmentation method: