RAG (Retrieval-Augmented Generation)
Learn how RAG combines large language models with external knowledge retrieval to produce accurate, grounded, and up-to-date responses — the most important pattern in modern AI applications.
What is RAG?
Retrieval-Augmented Generation (RAG) is an AI framework that enhances large language model (LLM) outputs by retrieving relevant information from external knowledge sources before generating a response. Instead of relying solely on the model's training data, RAG dynamically fetches the most relevant documents, passages, or data points and feeds them as context to the LLM.
RAG was introduced by Meta AI researchers in 2020 and has since become the most widely adopted pattern for building production AI applications. It solves the fundamental problem of LLM hallucination by grounding responses in verifiable, up-to-date information.
Think of RAG as giving an AI a library card — instead of answering from memory alone, it first looks up the relevant books, reads the pertinent sections, and then formulates an informed answer.
📄 Document Ingestion
Load PDFs, web pages, databases, and any text source into a searchable knowledge base.
🔢 Embedding & Chunking
Split documents into chunks and convert them into vector embeddings for semantic search.
🔍 Retrieval
When a user asks a question, find the most semantically relevant chunks using vector similarity.
🤖 Generation
Pass the retrieved chunks as context to the LLM, which generates an accurate, grounded answer.
How RAG Works — Step by Step
The Indexing Phase (Offline)
- Load Documents: Ingest your knowledge base — PDFs, web pages, databases, Notion docs, etc.
- Chunk Documents: Split large documents into smaller, semantically meaningful chunks (typically 200-1000 tokens).
- Generate Embeddings: Use an embedding model (e.g., OpenAI text-embedding-3-small) to convert each chunk into a dense vector.
- Store in Vector DB: Save the vectors + original text in a vector database like Pinecone, ChromaDB, or Weaviate.
The Query Phase (Online)
- Embed the Query: Convert the user's question into a vector using the same embedding model.
- Retrieve Top-K: Search the vector database for the K most similar chunks (typically K=3-10).
- Augment Prompt: Insert the retrieved chunks into the LLM prompt as context.
- Generate Answer: The LLM reads the context and generates a grounded response.
Code Examples
Python — Basic RAG with LangChain
# pip install langchain langchain-openai chromadb
from langchain_openai import OpenAIEmbeddings, ChatOpenAI
from langchain_community.vectorstores import Chroma
from langchain.text_splitter import RecursiveCharacterTextSplitter
from langchain_community.document_loaders import PyPDFLoader
# 1. Load and chunk documents
loader = PyPDFLoader("knowledge_base.pdf")
docs = loader.load()
splitter = RecursiveCharacterTextSplitter(chunk_size=500, chunk_overlap=50)
chunks = splitter.split_documents(docs)
# 2. Create vector store
embeddings = OpenAIEmbeddings(model="text-embedding-3-small")
vectorstore = Chroma.from_documents(chunks, embeddings)
# 3. Query with RAG
retriever = vectorstore.as_retriever(search_kwargs={"k": 5})
relevant_docs = retriever.invoke("What is the refund policy?")
# 4. Generate answer
llm = ChatOpenAI(model="gpt-4o")
context = "\n".join([doc.page_content for doc in relevant_docs])
response = llm.invoke(
f"Answer based on the context below:\n{context}\n\nQuestion: What is the refund policy?"
)
print(response.content)JavaScript — RAG with LlamaIndex
// npm install llamaindex
import { VectorStoreIndex, SimpleDirectoryReader } from "llamaindex";
// 1. Load documents from a directory
const documents = await new SimpleDirectoryReader()
.loadData("./data");
// 2. Create index (auto-chunks + embeds)
const index = await VectorStoreIndex.fromDocuments(documents);
// 3. Query with RAG
const queryEngine = index.asQueryEngine();
const response = await queryEngine.query(
"What are the key findings in the report?"
);
console.log(response.toString());RAG Strategies & Best Practices
Naive RAG
Simple retrieve-then-generate. Good for prototyping but can retrieve irrelevant chunks.
Advanced RAG
Adds re-ranking, query rewriting, and hybrid search (keyword + vector) for much better accuracy.
Modular RAG
Separates routing, retrieval, and generation into composable modules for production systems.
Graph RAG
Uses knowledge graphs alongside vectors to capture entity relationships for complex queries.
Chunking Strategies
- Fixed-size chunking: Simple, splits every N tokens. Fast but can break mid-sentence.
- Recursive chunking: Splits on paragraph → sentence → word boundaries. Most popular.
- Semantic chunking: Groups sentences by embedding similarity. Best quality but slower.
- Document-aware chunking: Respects headings, tables, and document structure.
Use Cases & Applications
Customer Support Bots
RAG over help docs, FAQs, and ticket history to answer customer questions accurately.
Legal Research
Search case law, contracts, and regulations to assist lawyers with precedent and compliance.
Code Documentation
RAG over codebases and docs to help developers understand and navigate large projects.
Healthcare
Query medical literature and clinical guidelines for evidence-based decision support.
Popular RAG Tools & Frameworks
| Tool | Type | Best For | Price |
|---|---|---|---|
| LangChain | Framework | Full RAG pipelines | Free / OSS |
| LlamaIndex | Framework | Data indexing & query | Free / OSS |
| Pinecone | Vector DB | Production vector search | Free tier + paid |
| ChromaDB | Vector DB | Local development | Free / OSS |
| Weaviate | Vector DB | Hybrid search | Free / OSS + cloud |