Vision Language Models (VLM)
Bridging the gap between sight and language: How models like GPT-4V and LLaVA understand and reason about images.
Seeing is Believing
Vision Language Models (VLMs) combine a vision encoder (like ViT) with a language model (LLM) to perform tasks that require understanding both modalities. They can analyze photos, read charts, solve math problems from screenshots, and caption complex scenes.
Architecture: The Projector
The magic of a VLM lies in how it connects image data to text data. This is done via a Multimodal Projector.
- Vision Encoder: (e.g., CLIP-ViT) Breaks the image into patches and converts them into embeddings.
- Projector (Adapter): A simple neural layer (Linear or MLP) that translates these "image embeddings" into "token embeddings" that the LLM understands.
- Language Model: The LLM treats these translated image tokens just like text words, allowing it to "read" the image.
LLaVA: Large Language and Vision Assistant
LLaVA is the leading open-source architecture for VLMs. It connects a CLIP vision encoder to Llama/Vicuna using a simple projection layer, trained on visual instruction data.
Implementation: Running a VLM
Using the transformers library to perform Visual Question Answering (VQA) with LLaVA:
from transformers import LlavaNextProcessor, LlavaNextForConditionalGeneration
import torch
from PIL import Image
import requests
# 1. Load Model
model_id = "llava-hf/llava-v1.6-mistral-7b-hf"
processor = LlavaNextProcessor.from_pretrained(model_id)
model = LlavaNextForConditionalGeneration.from_pretrained(
model_id,
torch_dtype=torch.float16,
low_cpu_mem_usage=True
)
model.to("cuda")
# 2. Load Image
url = "https://github.com/haotian-liu/LLaVA/blob/main/images/view.jpg?raw=true"
image = Image.open(requests.get(url, stream=True).raw)
# 3. Prompt the Model
prompt = "[INST] \nWhat is shown in this image and what time of day is it? [/INST]"
inputs = processor(prompt, image, return_tensors="pt").to("cuda")
# 4. Generate Response
output = model.generate(**inputs, max_new_tokens=100)
print(processor.decode(output[0], skip_special_tokens=True))
Capabilities & Limitations
OCR & Reasoning
Can read text in images and perform math or coding tasks based on it.
Spatial Hallucination
VLMs often struggle with precise counting or identifying exact spatial coordinates (left/right) of objects.