Topic 7 — RAG: Retrieval-Augmented Generation

1

The Problem RAG Solves

Large language models have three critical limitations that RAG addresses:

🕐

Stale Knowledge

Model training data has a cutoff date. Claude Opus's knowledge ends in early 2025. It can't answer questions about events that happened after that.

🤥

Hallucinations

When models don't know something, they confidently make up plausible-sounding answers. Give them real facts, and hallucinations drop dramatically.

🔒

No Private Data Access

Models can't read your company's internal docs, codebase, or customer data. They only work with what you give them in the prompt.

💰

Token Efficiency

You can't fit your entire knowledge base in the system prompt. RAG lets you dynamically retrieve only relevant documents, saving tokens and cost.

✅

RAG = Retrieval + LLM Generation

Instead of hoping the model knows something, you retrieve relevant documents, inject them into the prompt, then ask the model to answer based on those facts. This is the most important pattern in production AI systems.

2

How RAG Works — The Full Pipeline

RAG follows a clear four-step pipeline. Understanding each step is key to building reliable systems.

1. Query

→

2. Retrieve

→

3. Augment

→

4. Generate

Step 1: Query

The user asks a question. This query is passed to a retrieval system. The query itself isn't sent to Claude yet — it's used to find relevant documents.

Step 2: Retrieve

The retrieval system (typically a vector database like Pinecone, Weaviate, or Chroma) finds documents similar to the query. Most commonly, this uses semantic search — converting the query to numbers (embeddings) and finding nearby documents.

Step 3: Augment

The retrieved documents are injected into the prompt. A typical augmentation looks like:

Augmented Prompt

You are a helpful customer support agent.

<context>
The following documents are relevant to the customer's question:

---
From our Knowledge Base (Product FAQ):
Q: How do I reset my password?
A: Click "Forgot Password" on the login page, enter your email,
   and follow the verification link.
---

From our Support Docs:
Resetting passwords typically takes 5-10 minutes. If you don't receive
the email within 15 minutes, check your spam folder.
</context>

Based on the documents above, answer the customer's question.
If the documents don't contain the answer, say so.

User: I forgot my password. How do I reset it?

Step 4: Generate

Claude reads the augmented prompt (with context) and generates a response grounded in the retrieved facts. Because the model has the facts in front of it, it can't hallucinate — it either uses the docs or declines to answer.

💡

Analogy: Research Assistant

Without RAG, asking Claude a question is like asking someone to answer from memory alone. With RAG, it's like asking them a question while they're reading the relevant documents. They can cite sources and be confident their answer is accurate.

3

Embeddings & Vector Search

The magic of RAG happens in the retrieval step. How does the system know which documents are relevant? The answer: embeddings.

What Are Embeddings?

An embedding is a numerical representation of text. A model (like OpenAI's text-embedding-3-small) converts text into a vector — a list of numbers.

Embedding Example

"How do I reset my password?"
  ↓ (embedding model)
[0.123, -0.456, 0.789, -0.321, ..., 0.654]  ← 384 or 1536 dimensions

Semantic Similarity

The key insight: similar texts produce similar embeddings. You can measure similarity by computing the distance between vectors. The closer two vectors, the more similar the texts.

🔍

How Vector Search Works

1. Convert query to embedding. 2. Find nearest neighbor documents in vector space. 3. Return top K results.

📦

Vector Databases

Specialized databases (Pinecone, Weaviate, Chroma, Milvus) store embeddings and enable fast nearest-neighbor search over millions of documents.

💡

Analogy: Library Index Cards

Think of a library card catalog from the pre-digital era. Each book had an index card with subject tags. To find books on "machine learning," you'd look up that subject tag and get a list of relevant books. Embeddings work the same way — they're a numerical "subject tag" that lets you find similar documents instantly.

⚠️

Embedding Model Quality Matters

Not all embedding models are equally good. Using a weak embedding model means poor retrieval, which means you'll pass irrelevant documents to Claude. Always benchmark your retrieval quality. OpenAI's text-embedding-3 and Anthropic's embeddings are solid choices.

4

Chunking Strategies

Raw documents are too long to embed efficiently. You need to break them into chunks. Chunking strategy dramatically affects retrieval quality.

⚙️

Fixed-Size Chunking

How: Split every N tokens/words. Pros: Simple, predictable. Cons: Ignores semantic boundaries, may split sentences.

🧠

Semantic Chunking

How: Split based on topic changes, paragraph breaks. Pros: Preserves meaning. Cons: More complex, variable chunk sizes.

📚

Recursive Chunking

How: Try semantic boundaries first, fall back to fixed-size if needed. Pros: Best of both worlds. Cons: Slower to implement.

🎯

Domain-Specific

How: Respect structure (chapters, sections, code functions). Pros: Optimal for domain. Cons: Requires domain knowledge.

Best Practices

Start with 512-1024 token chunks: This is a good default. Larger chunks are less precise; smaller chunks may lack context.
Use overlap: Include 20-50 tokens of overlap between chunks to preserve context across boundaries.
Benchmark: Test different chunk sizes on real queries and measure retrieval precision.
Preserve metadata: Keep track of source document, section, and position in original document.

Python — Simple Chunking

def chunk_text(text, chunk_size=512, overlap=50):
    # Simple fixed-size chunking with overlap
    tokens = text.split()
    chunks = []

    for i in range(0, len(tokens), chunk_size - overlap):
        chunk = ' '.join(tokens[i:i + chunk_size])
        chunks.append(chunk)

    return chunks

# Usage:
doc = "Your long document here..."
chunks = chunk_text(doc, chunk_size=512, overlap=50)

5

The RAG Prompt Pattern

How do you inject retrieved context into a prompt? There's a proven pattern that works well.

Python — RAG Prompt Pattern

import anthropic

client = anthropic.Anthropic()

# 1. Retrieve relevant documents (from vector DB)
query = "How do I configure API authentication?"
retrieved_docs = retrieve_from_vector_db(query)  # Returns top K docs

# 2. Format context
context = "\n\n---\n\n".join(retrieved_docs)

# 3. Build augmented prompt
SYSTEM_PROMPT = """You are a technical support assistant.
Use the provided documentation to answer questions.
If the docs don't contain the answer, say so clearly.
Never make up information."""

user_message = f"""<context>
{context}
</context>

User Question: {query}"""

# 4. Call Claude with augmented prompt
response = client.messages.create(
    model="claude-opus-4-6",
    max_tokens=1024,
    system=SYSTEM_PROMPT,
    messages=[{"role": "user", "content": user_message}]
)

print(response.content[0].text)

💎

Use XML Tags for Clarity

Using <context> tags makes it clear to the model where the retrieved docs are. Claude (and other models) parses XML structure especially well. Consider also tagging individual documents: <document source="FAQ.md">...</document>

6

Common RAG Pitfalls & Fixes

1. Irrelevant Retrieval

⚠️

Problem

You retrieve documents that don't actually answer the question. The vector database finds topically related documents but not the specific answer.

Fixes: Better embedding model, better chunking, increase top-K results, hybrid search (vector + keyword).

2. Lost in the Middle

⚠️

Problem

When you retrieve many documents (10+), Claude sometimes misses important information in the middle and focuses on the beginning and end.

Fixes: Rerank documents by relevance before injecting, keep top-3 to 5, or use a more recent model (they handle long context better).

3. Context Overflow

⚠️

Problem

You retrieve so much context that you exceed the model's context window or make the request very expensive.

Fixes: Tighter chunk sizes, lower top-K, estimate context size before API call, use smaller models for validation.

4. Conflicting Information

⚠️

Problem

You retrieve documents with conflicting information. Claude has to choose which to trust, and it might not choose the right one.

Fixes: Include source metadata (document date, version), tell Claude which sources are authoritative, validate inconsistencies in your knowledge base.

7

Hands-On: Simple RAG with Python

Let's build a complete RAG system using Chroma (an open-source vector database) and the Anthropic API.

Python — Complete RAG System

#!/usr/bin/env python3
import anthropic
import chromadb
from chromadb.config import Settings

# Initialize clients
client = anthropic.Anthropic()
chroma_client = chromadb.Client()

# Create a collection (like a table)
collection = chroma_client.create_collection(name="company_docs")

# Add documents to the vector store
documents = [
    "Our API uses OAuth 2.0 for authentication. "
    "Request an access token with your client ID and secret.",

    "Rate limits: 100 requests per minute for free tier, "
    "1000 per minute for premium.",

    "To set up webhooks, go to Settings > Integrations. "
    "All webhook payloads include a signature for verification.",
]

# Chroma automatically embeds documents
collection.add(
    documents=documents,
    ids=["doc1", "doc2", "doc3"]
)

# User query
query = "How do I authenticate with your API?"

# Retrieve relevant documents
results = collection.query(
    query_texts=[query],
    n_results=2  # Top 2 results
)

# Format retrieved docs
context = "\n".join(results["documents"][0])

# Build augmented prompt
system_prompt = """You are a helpful API documentation assistant.
Answer questions using the provided documentation.
If the docs don't address the question, say so."""

user_message = f"""<context>
{context}
</context>

Question: {query}"""

# Call Claude
response = client.messages.create(
    model="claude-opus-4-6",
    max_tokens=500,
    system=system_prompt,
    messages=[{"role": "user", "content": user_message}]
)

print("Answer:", response.content[0].text)

💎

Try It Yourself

Install dependencies: pip install anthropic chromadb Set your API key: export ANTHROPIC_API_KEY="..." Run the script. It will retrieve relevant docs and generate an answer grounded in your knowledge base.

✅

Next Steps

In production, replace Chroma with Pinecone or Weaviate for scalability. Add document metadata (source, date). Implement reranking to filter low-relevance results. Monitor retrieval quality with a test set.

★

2025+ Update: Advanced RAG Patterns

RAG has evolved significantly. Beyond basic semantic search, two advanced patterns are now widely used in production:

🕸️

GraphRAG

Instead of just finding semantically similar chunks, GraphRAG builds a knowledge graph from your documents — extracting entities and relationships. It then traverses the graph to find contextually related information that semantic search alone would miss. Particularly powerful for complex questions that span multiple documents or require multi-hop reasoning.

🤖

Agentic RAG

Traditional RAG does a single retrieval step. Agentic RAG lets an AI agent iteratively query, evaluate results, refine its search, and re-retrieve — just like a human researcher would. The agent decides if it has enough context or needs to dig deeper. This dramatically improves answer quality for complex, multi-faceted questions.

✅

MCP for Data Access

The Model Context Protocol (MCP), introduced by Anthropic and now adopted across the industry, provides a standardized way for LLMs to connect to external data sources — databases, APIs, file systems, and more. MCP is becoming the standard interface layer for RAG systems, replacing custom integrations with a universal protocol.

✓

Check Your Understanding

Quick Quiz — 4 Questions

1. What are the three main limitations of LLMs that RAG addresses?

2. What is the second step in the RAG pipeline?

3. What is an embedding?

4. What is "lost in the middle"?

✓

Topic 7 Summary

RAG (Retrieval-Augmented Generation) is the most important pattern in production AI. It solves three critical LLM limitations: stale knowledge, hallucinations, and lack of private data access. The pipeline is simple: Query → Retrieve → Augment → Generate. Retrieval uses embeddings and vector search to find relevant documents. Chunking strategy affects retrieval quality significantly. Common pitfalls include irrelevant retrieval, lost-in-the-middle effects, context overflow, and conflicting information — all have concrete fixes.

Next up → Topic 8: Tool Use & Function Calling
Extend your LLM beyond text — let it call APIs, run code, and take action in the real world.