Give your LLM access to external knowledge — the most important pattern in production AI.
Large language models have three critical limitations that RAG addresses:
Model training data has a cutoff date. Claude Opus's knowledge ends in early 2025. It can't answer questions about events that happened after that.
When models don't know something, they confidently make up plausible-sounding answers. Give them real facts, and hallucinations drop dramatically.
Models can't read your company's internal docs, codebase, or customer data. They only work with what you give them in the prompt.
You can't fit your entire knowledge base in the system prompt. RAG lets you dynamically retrieve only relevant documents, saving tokens and cost.
Instead of hoping the model knows something, you retrieve relevant documents, inject them into the prompt, then ask the model to answer based on those facts. This is the most important pattern in production AI systems.
RAG follows a clear four-step pipeline. Understanding each step is key to building reliable systems.
Step 1: Query
The user asks a question. This query is passed to a retrieval system. The query itself isn't sent to Claude yet — it's used to find relevant documents.
Step 2: Retrieve
The retrieval system (typically a vector database like Pinecone, Weaviate, or Chroma) finds documents similar to the query. Most commonly, this uses semantic search — converting the query to numbers (embeddings) and finding nearby documents.
Step 3: Augment
The retrieved documents are injected into the prompt. A typical augmentation looks like:
You are a helpful customer support agent.
<context>
The following documents are relevant to the customer's question:
---
From our Knowledge Base (Product FAQ):
Q: How do I reset my password?
A: Click "Forgot Password" on the login page, enter your email,
and follow the verification link.
---
From our Support Docs:
Resetting passwords typically takes 5-10 minutes. If you don't receive
the email within 15 minutes, check your spam folder.
</context>
Based on the documents above, answer the customer's question.
If the documents don't contain the answer, say so.
User: I forgot my password. How do I reset it?
Step 4: Generate
Claude reads the augmented prompt (with context) and generates a response grounded in the retrieved facts. Because the model has the facts in front of it, it can't hallucinate — it either uses the docs or declines to answer.
Without RAG, asking Claude a question is like asking someone to answer from memory alone. With RAG, it's like asking them a question while they're reading the relevant documents. They can cite sources and be confident their answer is accurate.
The magic of RAG happens in the retrieval step. How does the system know which documents are relevant? The answer: embeddings.
What Are Embeddings?
An embedding is a numerical representation of text. A model (like OpenAI's text-embedding-3-small) converts text into a vector — a list of numbers.
"How do I reset my password?"
↓ (embedding model)
[0.123, -0.456, 0.789, -0.321, ..., 0.654] ← 384 or 1536 dimensions
Semantic Similarity
The key insight: similar texts produce similar embeddings. You can measure similarity by computing the distance between vectors. The closer two vectors, the more similar the texts.
1. Convert query to embedding. 2. Find nearest neighbor documents in vector space. 3. Return top K results.
Specialized databases (Pinecone, Weaviate, Chroma, Milvus) store embeddings and enable fast nearest-neighbor search over millions of documents.
Think of a library card catalog from the pre-digital era. Each book had an index card with subject tags. To find books on "machine learning," you'd look up that subject tag and get a list of relevant books. Embeddings work the same way — they're a numerical "subject tag" that lets you find similar documents instantly.
Not all embedding models are equally good. Using a weak embedding model means poor retrieval, which means you'll pass irrelevant documents to Claude. Always benchmark your retrieval quality. OpenAI's text-embedding-3 and Anthropic's embeddings are solid choices.
Raw documents are too long to embed efficiently. You need to break them into chunks. Chunking strategy dramatically affects retrieval quality.
How: Split every N tokens/words. Pros: Simple, predictable. Cons: Ignores semantic boundaries, may split sentences.
How: Split based on topic changes, paragraph breaks. Pros: Preserves meaning. Cons: More complex, variable chunk sizes.
How: Try semantic boundaries first, fall back to fixed-size if needed. Pros: Best of both worlds. Cons: Slower to implement.
How: Respect structure (chapters, sections, code functions). Pros: Optimal for domain. Cons: Requires domain knowledge.
Best Practices
def chunk_text(text, chunk_size=512, overlap=50): # Simple fixed-size chunking with overlap tokens = text.split() chunks = [] for i in range(0, len(tokens), chunk_size - overlap): chunk = ' '.join(tokens[i:i + chunk_size]) chunks.append(chunk) return chunks # Usage: doc = "Your long document here..." chunks = chunk_text(doc, chunk_size=512, overlap=50)
How do you inject retrieved context into a prompt? There's a proven pattern that works well.
import anthropic client = anthropic.Anthropic() # 1. Retrieve relevant documents (from vector DB) query = "How do I configure API authentication?" retrieved_docs = retrieve_from_vector_db(query) # Returns top K docs # 2. Format context context = "\n\n---\n\n".join(retrieved_docs) # 3. Build augmented prompt SYSTEM_PROMPT = """You are a technical support assistant. Use the provided documentation to answer questions. If the docs don't contain the answer, say so clearly. Never make up information.""" user_message = f"""<context> {context} </context> User Question: {query}""" # 4. Call Claude with augmented prompt response = client.messages.create( model="claude-opus-4-6", max_tokens=1024, system=SYSTEM_PROMPT, messages=[{"role": "user", "content": user_message}] ) print(response.content[0].text)
Using <context> tags makes it clear to the model where the retrieved docs are.
Claude (and other models) parses XML structure especially well. Consider also tagging individual documents:
<document source="FAQ.md">...</document>
1. Irrelevant Retrieval
You retrieve documents that don't actually answer the question. The vector database finds topically related documents but not the specific answer.
Fixes: Better embedding model, better chunking, increase top-K results, hybrid search (vector + keyword).
2. Lost in the Middle
When you retrieve many documents (10+), Claude sometimes misses important information in the middle and focuses on the beginning and end.
Fixes: Rerank documents by relevance before injecting, keep top-3 to 5, or use a more recent model (they handle long context better).
3. Context Overflow
You retrieve so much context that you exceed the model's context window or make the request very expensive.
Fixes: Tighter chunk sizes, lower top-K, estimate context size before API call, use smaller models for validation.
4. Conflicting Information
You retrieve documents with conflicting information. Claude has to choose which to trust, and it might not choose the right one.
Fixes: Include source metadata (document date, version), tell Claude which sources are authoritative, validate inconsistencies in your knowledge base.
Let's build a complete RAG system using Chroma (an open-source vector database) and the Anthropic API.
#!/usr/bin/env python3 import anthropic import chromadb from chromadb.config import Settings # Initialize clients client = anthropic.Anthropic() chroma_client = chromadb.Client() # Create a collection (like a table) collection = chroma_client.create_collection(name="company_docs") # Add documents to the vector store documents = [ "Our API uses OAuth 2.0 for authentication. " "Request an access token with your client ID and secret.", "Rate limits: 100 requests per minute for free tier, " "1000 per minute for premium.", "To set up webhooks, go to Settings > Integrations. " "All webhook payloads include a signature for verification.", ] # Chroma automatically embeds documents collection.add( documents=documents, ids=["doc1", "doc2", "doc3"] ) # User query query = "How do I authenticate with your API?" # Retrieve relevant documents results = collection.query( query_texts=[query], n_results=2 # Top 2 results ) # Format retrieved docs context = "\n".join(results["documents"][0]) # Build augmented prompt system_prompt = """You are a helpful API documentation assistant. Answer questions using the provided documentation. If the docs don't address the question, say so.""" user_message = f"""<context> {context} </context> Question: {query}""" # Call Claude response = client.messages.create( model="claude-opus-4-6", max_tokens=500, system=system_prompt, messages=[{"role": "user", "content": user_message}] ) print("Answer:", response.content[0].text)
Install dependencies: pip install anthropic chromadb
Set your API key: export ANTHROPIC_API_KEY="..."
Run the script. It will retrieve relevant docs and generate an answer grounded in your knowledge base.
In production, replace Chroma with Pinecone or Weaviate for scalability. Add document metadata (source, date). Implement reranking to filter low-relevance results. Monitor retrieval quality with a test set.
1. What are the three main limitations of LLMs that RAG addresses?
2. What is the second step in the RAG pipeline?
3. What is an embedding?
4. What is "lost in the middle"?
RAG (Retrieval-Augmented Generation) is the most important pattern in production AI. It solves three critical LLM limitations: stale knowledge, hallucinations, and lack of private data access. The pipeline is simple: Query → Retrieve → Augment → Generate. Retrieval uses embeddings and vector search to find relevant documents. Chunking strategy affects retrieval quality significantly. Common pitfalls include irrelevant retrieval, lost-in-the-middle effects, context overflow, and conflicting information — all have concrete fixes.
Next up → Topic 8: Tool Use & Function Calling
Extend your LLM beyond text — let it call APIs, run code, and take action in the real world.