NLP Fundamentals

Masked Language Models (MLM)

The "Encoder-Only" powerhouses behind search engines, sentiment analysis, and semantic understanding.

Encoder vs. Decoder

While GPT (Generative Pre-trained Transformer) gets all the hype for writing text, BERT (Bidirectional Encoder Representations from Transformers) revolutionized how computers read and understand text.

Feature Encoder-Only (BERT) Decoder-Only (GPT)
Direction Bidirectional (sees future & past) Unidirectional (left-to-right)
Best For Classification, NER, Search Text Generation, Chat
Pre-training Masked Language Modeling (MLM) Causal Language Modeling (CLM)

The Masking Game

MLMs are trained by hiding words in a sentence and forcing the model to guess them based on context. This forces the model to understand the relationship between every word and every other word.

Original: "The chef cooked a delicious meal."
Masked:   "The [MASK] cooked a delicious [MASK]."
Target:   "chef", "meal"

Evolution of MLMs

BERT (2018)

Google's breakthrough. Added "Next Sentence Prediction" (NSP) task.

RoBERTa (2019)

Meta's optimized version. Removed NSP, trained longer on more data. Often better than BERT.

DeBERTa (2020)

Microsoft's upgrade using disentangled attention mechanisms, currently keeping SOTA on many NLU benchmarks.

Implementation: Feature Extraction

Using BERT to get embeddings (numerical representations) of text for similarity search:

from transformers import AutoTokenizer, AutoModel
import torch

# Load a popular sentence-transformer model
model_id = "sentence-transformers/all-MiniLM-L6-v2"
tokenizer = AutoTokenizer.from_pretrained(model_id)
model = AutoModel.from_pretrained(model_id)

sentences = ["This is an example sentence", "Each sentence is converted"]

# Tokenize
inputs = tokenizer(sentences, padding=True, truncation=True, return_tensors="pt")

# Forward pass
with torch.no_grad():
    outputs = model(**inputs)

# Get the [CLS] token embeddings (sentence representation)
embeddings = outputs.last_hidden_state[:, 0, :]

print(f"Embedding shape: {embeddings.shape}") 
# Output: [2, 384]