Multimodal

Vision Language Models (VLM)

Bridging the gap between sight and language: How models like GPT-4V and LLaVA understand and reason about images.

Seeing is Believing

Vision Language Models (VLMs) combine a vision encoder (like ViT) with a language model (LLM) to perform tasks that require understanding both modalities. They can analyze photos, read charts, solve math problems from screenshots, and caption complex scenes.

Architecture: The Projector

The magic of a VLM lies in how it connects image data to text data. This is done via a Multimodal Projector.

  • Vision Encoder: (e.g., CLIP-ViT) Breaks the image into patches and converts them into embeddings.
  • Projector (Adapter): A simple neural layer (Linear or MLP) that translates these "image embeddings" into "token embeddings" that the LLM understands.
  • Language Model: The LLM treats these translated image tokens just like text words, allowing it to "read" the image.

LLaVA: Large Language and Vision Assistant

LLaVA is the leading open-source architecture for VLMs. It connects a CLIP vision encoder to Llama/Vicuna using a simple projection layer, trained on visual instruction data.

Implementation: Running a VLM

Using the transformers library to perform Visual Question Answering (VQA) with LLaVA:

from transformers import LlavaNextProcessor, LlavaNextForConditionalGeneration
import torch
from PIL import Image
import requests

# 1. Load Model
model_id = "llava-hf/llava-v1.6-mistral-7b-hf"
processor = LlavaNextProcessor.from_pretrained(model_id)
model = LlavaNextForConditionalGeneration.from_pretrained(
    model_id, 
    torch_dtype=torch.float16, 
    low_cpu_mem_usage=True
) 
model.to("cuda")

# 2. Load Image
url = "https://github.com/haotian-liu/LLaVA/blob/main/images/view.jpg?raw=true"
image = Image.open(requests.get(url, stream=True).raw)

# 3. Prompt the Model
prompt = "[INST] \nWhat is shown in this image and what time of day is it? [/INST]"
inputs = processor(prompt, image, return_tensors="pt").to("cuda")

# 4. Generate Response
output = model.generate(**inputs, max_new_tokens=100)
print(processor.decode(output[0], skip_special_tokens=True))

Capabilities & Limitations

OCR & Reasoning

Can read text in images and perform math or coding tasks based on it.

Spatial Hallucination

VLMs often struggle with precise counting or identifying exact spatial coordinates (left/right) of objects.