Authors: Jian Hu
First published on: 2024/12/1
Multimodal large language models (MLLMs) are designed to process and integrate information from various modalities, such as text, images, and audio. Their architectures often leverage advanced techniques like cross-attention and token concatenation to enhance their performance in understanding and generating multimodal content.
For visual large models, the first step is to convert images or videos into a format that can be recognized by large language models (LLMs). Typically, this involves dividing the images into patches, which are then encoded using a Vision Transformer (ViT) network. After encoding, these patches are fed into a projector MLP (multi-layer perceptron) to transform them into the embedding space of the LLM. Finally, input these visual tokens and text tokens into the LLM. We will demonstrate this process step by step using Qwen2-VL, Llama3.2 Vision and InternVL-2.5 as examples.
In the image processor, Qwen2-VL will process the input image. Overall, Qwen2-VL does two things at this step:
Some models, such as Llama3.2 vision and InternVL-2.5, first find the closet aspect ratio and then resize the images:
What if the content we want to process is not an image, but a video? What additional processing do we need to perform? Qwen2-VL segment the patches along the temporal dimension as well, the temporal_patch_size is use to enhance the video temporal features:
self.temporal_patch_size = 2
channel * self.temporal_patch_size * self.patch_size * self.patch_size
For a video, the number of patches sent to the ViT can be described as:
grid_t * grid_h * grid_w
Therefore, in the image processor, you can see:
# This is key to understanding what the image processor is doing
flatten_patches = patches.reshape(
grid_t * grid_h * grid_w, channel * self.temporal_patch_size * self.patch_size * self.patch_size
)
Then, Qwen2-VL "stack" each image into two frames, meaning that one image becomes a "two-frame identical" "small video," allowing both images and videos to use the same patch segmentation method: