Segment Anything Model (SAM)
The "GPT of Computer Vision." Meta's foundational model that can identify and cut out pixels of any object, zero-shot.
Promptable Segmentation
Before SAM, training a segmentation model required thousands of manually outlined images for specific objects (e.g., a "dog segmenter"). SAM changed the game by responding to user prompts:
- Point: Click on an object.
- Box: Draw a box around an area.
- Text: Type "segment the wheels".
Because it was trained on the massive SA-1B dataset (11 million images, 1 billion masks), it has learned a general concept of "objects," allowing it to segment things it has never seen before (Zero-Shot).
SAM 2: Video Segmentation
Released in mid-2024, SAM 2 unified image and video segmentation. It can track objects across video frames with high accuracy, handling occlusion (when an object temporarily hides behind another) extremely well.
Architecture Overview
Image Encoder (Heavy)
A heavy ViT backbone runs once per image to create a rich embedding. This is the slow part.
Prompt Encoder & Mask Decoder (Light)
These run in milliseconds. Once the image is encoded, you can prompt it thousands of times instantly in the browser.
Python Usage
from segment_anything import sam_model_registry, SamPredictor
import cv2
import numpy as np
# Load model
sam = sam_model_registry["vit_h"](checkpoint="sam_vit_h_4b8939.pth")
sam.to(device="cuda")
predictor = SamPredictor(sam)
# Read Image
image = cv2.imread('dog.jpg')
image = cv2.cvtColor(image, cv2.COLOR_BGR2RGB)
# Encode Image (Slow step)
predictor.set_image(image)
# Predict Mask from Point (Fast step)
# Point at x=500, y=375
input_point = np.array([[500, 375]])
input_label = np.array([1])
masks, scores, logits = predictor.predict(
point_coords=input_point,
point_labels=input_label,
multimask_output=True,
)
print(f"Best mask score: {scores[np.argmax(scores)]}")
Medical Adaptation (MedSAM)
While SAM is general, it sometimes struggles with specific medical imaging (CT/MRI). MedSAM is a fine-tuned version of SAM specifically for medical segmentation, proving that SAM can serve as a foundation for domain-specific tools.