Masked Language Models (MLM)
The "Encoder-Only" powerhouses behind search engines, sentiment analysis, and semantic understanding.
Encoder vs. Decoder
While GPT (Generative Pre-trained Transformer) gets all the hype for writing text, BERT (Bidirectional Encoder Representations from Transformers) revolutionized how computers read and understand text.
| Feature | Encoder-Only (BERT) | Decoder-Only (GPT) |
|---|---|---|
| Direction | Bidirectional (sees future & past) | Unidirectional (left-to-right) |
| Best For | Classification, NER, Search | Text Generation, Chat |
| Pre-training | Masked Language Modeling (MLM) | Causal Language Modeling (CLM) |
The Masking Game
MLMs are trained by hiding words in a sentence and forcing the model to guess them based on context. This forces the model to understand the relationship between every word and every other word.
Original: "The chef cooked a delicious meal."
Masked: "The [MASK] cooked a delicious [MASK]."
Target: "chef", "meal"
Evolution of MLMs
BERT (2018)
Google's breakthrough. Added "Next Sentence Prediction" (NSP) task.
RoBERTa (2019)
Meta's optimized version. Removed NSP, trained longer on more data. Often better than BERT.
DeBERTa (2020)
Microsoft's upgrade using disentangled attention mechanisms, currently keeping SOTA on many NLU benchmarks.
Implementation: Feature Extraction
Using BERT to get embeddings (numerical representations) of text for similarity search:
from transformers import AutoTokenizer, AutoModel
import torch
# Load a popular sentence-transformer model
model_id = "sentence-transformers/all-MiniLM-L6-v2"
tokenizer = AutoTokenizer.from_pretrained(model_id)
model = AutoModel.from_pretrained(model_id)
sentences = ["This is an example sentence", "Each sentence is converted"]
# Tokenize
inputs = tokenizer(sentences, padding=True, truncation=True, return_tensors="pt")
# Forward pass
with torch.no_grad():
outputs = model(**inputs)
# Get the [CLS] token embeddings (sentence representation)
embeddings = outputs.last_hidden_state[:, 0, :]
print(f"Embedding shape: {embeddings.shape}")
# Output: [2, 384]