AI Concepts

RAG (Retrieval-Augmented Generation)

Learn how RAG combines large language models with external knowledge retrieval to produce accurate, grounded, and up-to-date responses — the most important pattern in modern AI applications.

What is RAG?

Retrieval-Augmented Generation (RAG) is an AI framework that enhances large language model (LLM) outputs by retrieving relevant information from external knowledge sources before generating a response. Instead of relying solely on the model's training data, RAG dynamically fetches the most relevant documents, passages, or data points and feeds them as context to the LLM.

RAG was introduced by Meta AI researchers in 2020 and has since become the most widely adopted pattern for building production AI applications. It solves the fundamental problem of LLM hallucination by grounding responses in verifiable, up-to-date information.

Think of RAG as giving an AI a library card — instead of answering from memory alone, it first looks up the relevant books, reads the pertinent sections, and then formulates an informed answer.

📄 Document Ingestion

Load PDFs, web pages, databases, and any text source into a searchable knowledge base.

🔢 Embedding & Chunking

Split documents into chunks and convert them into vector embeddings for semantic search.

🔍 Retrieval

When a user asks a question, find the most semantically relevant chunks using vector similarity.

🤖 Generation

Pass the retrieved chunks as context to the LLM, which generates an accurate, grounded answer.

How RAG Works — Step by Step

The Indexing Phase (Offline)

Load Documents: Ingest your knowledge base — PDFs, web pages, databases, Notion docs, etc.
Chunk Documents: Split large documents into smaller, semantically meaningful chunks (typically 200-1000 tokens).
Generate Embeddings: Use an embedding model (e.g., OpenAI text-embedding-3-small) to convert each chunk into a dense vector.
Store in Vector DB: Save the vectors + original text in a vector database like Pinecone, ChromaDB, or Weaviate.

The Query Phase (Online)

Embed the Query: Convert the user's question into a vector using the same embedding model.
Retrieve Top-K: Search the vector database for the K most similar chunks (typically K=3-10).
Augment Prompt: Insert the retrieved chunks into the LLM prompt as context.
Generate Answer: The LLM reads the context and generates a grounded response.

Code Examples

Python — Basic RAG with LangChain

# pip install langchain langchain-openai chromadb
from langchain_openai import OpenAIEmbeddings, ChatOpenAI
from langchain_community.vectorstores import Chroma
from langchain.text_splitter import RecursiveCharacterTextSplitter
from langchain_community.document_loaders import PyPDFLoader

# 1. Load and chunk documents
loader = PyPDFLoader("knowledge_base.pdf")
docs = loader.load()
splitter = RecursiveCharacterTextSplitter(chunk_size=500, chunk_overlap=50)
chunks = splitter.split_documents(docs)

# 2. Create vector store
embeddings = OpenAIEmbeddings(model="text-embedding-3-small")
vectorstore = Chroma.from_documents(chunks, embeddings)

# 3. Query with RAG
retriever = vectorstore.as_retriever(search_kwargs={"k": 5})
relevant_docs = retriever.invoke("What is the refund policy?")

# 4. Generate answer
llm = ChatOpenAI(model="gpt-4o")
context = "\n".join([doc.page_content for doc in relevant_docs])
response = llm.invoke(
    f"Answer based on the context below:\n{context}\n\nQuestion: What is the refund policy?"
)
print(response.content)

JavaScript — RAG with LlamaIndex

// npm install llamaindex
import { VectorStoreIndex, SimpleDirectoryReader } from "llamaindex";

// 1. Load documents from a directory
const documents = await new SimpleDirectoryReader()
  .loadData("./data");

// 2. Create index (auto-chunks + embeds)
const index = await VectorStoreIndex.fromDocuments(documents);

// 3. Query with RAG
const queryEngine = index.asQueryEngine();
const response = await queryEngine.query(
  "What are the key findings in the report?"
);
console.log(response.toString());

RAG Strategies & Best Practices

Naive RAG

Simple retrieve-then-generate. Good for prototyping but can retrieve irrelevant chunks.

Advanced RAG

Adds re-ranking, query rewriting, and hybrid search (keyword + vector) for much better accuracy.

Modular RAG

Separates routing, retrieval, and generation into composable modules for production systems.

Graph RAG

Uses knowledge graphs alongside vectors to capture entity relationships for complex queries.

Chunking Strategies

Fixed-size chunking: Simple, splits every N tokens. Fast but can break mid-sentence.
Recursive chunking: Splits on paragraph → sentence → word boundaries. Most popular.
Semantic chunking: Groups sentences by embedding similarity. Best quality but slower.
Document-aware chunking: Respects headings, tables, and document structure.

Use Cases & Applications

Customer Support Bots

RAG over help docs, FAQs, and ticket history to answer customer questions accurately.

Legal Research

Search case law, contracts, and regulations to assist lawyers with precedent and compliance.

Code Documentation

RAG over codebases and docs to help developers understand and navigate large projects.

Healthcare

Query medical literature and clinical guidelines for evidence-based decision support.

Popular RAG Tools & Frameworks

Tool	Type	Best For	Price
LangChain	Framework	Full RAG pipelines	Free / OSS
LlamaIndex	Framework	Data indexing & query	Free / OSS
Pinecone	Vector DB	Production vector search	Free tier + paid
ChromaDB	Vector DB	Local development	Free / OSS
Weaviate	Vector DB	Hybrid search	Free / OSS + cloud