Topic 17 — Memory & Context Management

1

The Memory Problem

Every time you call an LLM, it starts completely fresh. It has no memory of anything that happened before. This is a fundamental architectural constraint — and the biggest challenge when building agents that need to maintain continuity.

The core issues are clear:

LLMs are stateless — every API call starts from scratch
Context windows are finite — you can't send the entire history forever
Users expect continuity — "remember what I told you yesterday"
Agents need past context — decisions depend on prior interactions

⚡

Short-term Memory

The current conversation — messages in the active context window. Lost when the session ends.

💾

Long-term Memory

Persisted across sessions — stored in databases, files, or vector stores. Survives restarts.

📸

Episodic Memory

Specific events and experiences — "the user reported a bug on Tuesday" or "we deployed v2.1 last week."

📚

Semantic Memory

General knowledge and facts — "the user prefers Python" or "the project uses PostgreSQL."

💡

Analogy: Amnesia in the Moment

Think of an LLM without memory like a person with amnesia — brilliant in the moment, but can't remember yesterday's conversation. Every interaction, you have to re-introduce yourself and re-explain the context. Memory systems are how we cure that amnesia.

2

Memory Types

There are four core memory patterns (based on Lilian Weng's framework). Each makes a different tradeoff between completeness, token efficiency, and relevance.

📋

Buffer Memory

Store the full conversation history. Simple and complete, but grows unbounded and eventually overflows the context window.

🪟

Window Memory

Keep only the last N turns. Caps token usage but loses older context — the agent forgets early parts of the conversation.

📝

Summary Memory

Use the LLM itself to summarize older messages. Preserves key information in fewer tokens, but lossy by nature.

🔍

Vector Memory

Embed messages as vectors and retrieve only relevant ones via similarity search. Scales to massive histories.

Buffer Memory — the simplest approach. Store everything:

Python

class BufferMemory:
    def __init__(self):
        self.messages = []

    def add(self, role: str, content: str):
        self.messages.append({"role": role, "content": content})

    def get_context(self) -> list[dict]:
        return self.messages.copy()

Summary Memory — compress older messages using the LLM:

Python

class SummaryMemory:
    def __init__(self, client):
        self.client = client
        self.summary = ""
        self.recent = []  # Last 3 turns

    def add(self, role: str, content: str):
        self.recent.append({"role": role, "content": content})
        if len(self.recent) > 6:  # 3 turns = 6 messages
            self._compress()

    def _compress(self):
        old = self.recent[:4]
        self.recent = self.recent[4:]
        # Ask LLM to update summary
        resp = self.client.messages.create(
            model="claude-haiku-4-5-20251001",
            max_tokens=300,
            messages=[{
                "role": "user",
                "content": f"Current summary:\n{self.summary}\n\n"
                           f"New messages:\n{old}\n\n"
                           "Update the summary concisely."
            }]
        )
        self.summary = resp.content[0].text

3

Context Window Strategies

Making the most of a limited context window is critical. Here are the main strategies, from simplest to most sophisticated:

Strategy	How It Works	Tradeoff
Truncation	Drop the oldest messages when the window fills up	Simple but loses early context completely
Sliding Window	Keep only the last N tokens of conversation	Predictable cost, but no long-term recall
Summarization	Compress history into a rolling summary	Retains key info, but lossy — details get dropped
RAG	Retrieve only relevant context from a vector store	Scales well, but retrieval quality is critical
Hierarchical	Combine summary + recent messages + retrieved context	Best quality, but most complex to implement

A well-designed context window allocates space for each component. Here's a typical layout:

System Prompt

Summary

Retrieved Context

Recent Messages

Current Input

Response Space

← Total Context Window →

⚠️

Don't Fill the Entire Window

Never fill the entire context window — always leave room for the model's response. A good rule: use no more than 75% for input. If you stuff the window full, the model either truncates its response or produces degraded output.

4

Vector Memory

Vector memory uses embeddings to convert text into numerical vectors, then retrieves relevant memories via similarity search. The flow is: text → embedding model → vector → store in database → query by similarity.

This approach mirrors how human memory works — you don't recall every conversation you've ever had. Instead, the current context triggers retrieval of relevant memories.

Python

import chromadb

client = chromadb.Client()
collection = client.create_collection("memories")

# Store a memory
collection.add(
    documents=["User prefers Python over JavaScript"],
    ids=["mem_001"],
    metadatas=[{"type": "preference", "date": "2025-01-15"}]
)

# Retrieve relevant memories
results = collection.query(
    query_texts=["What programming language should I recommend?"],
    n_results=3
)

# Inject into prompt
memories = "\n".join(results["documents"][0])
system = f"User context:\n{memories}\n\nBe helpful."

✅

Why Vector Memory Matters

Vector memory is the closest thing to how humans remember — you don't recall everything, just what's relevant to the current situation. A well-tuned vector store can search across thousands of past interactions and surface exactly the context the agent needs.

5

Advanced Patterns

Production memory systems typically combine multiple memory types into a hierarchical architecture. This gives you the best of all worlds — recent context is complete, older context is summarized, and everything else is retrievable via semantic search.

Python

class HierarchicalMemory:
    """Combines buffer + summary + vector memory."""

    def __init__(self, client, vector_store):
        self.client = client
        self.vector_store = vector_store
        self.recent = []          # Last 5 turns (buffer)
        self.summary = ""         # Rolling summary

    def add(self, role: str, content: str):
        self.recent.append({"role": role, "content": content})
        # Also store in vector DB for long-term retrieval
        self.vector_store.add(
            documents=[content],
            ids=[f"msg_{len(self.recent)}"],
            metadatas=[{"role": role, "type": "conversation"}]
        )
        if len(self.recent) > 10:
            self._compress()

    def get_context(self, current_query: str) -> str:
        # 1. Retrieve relevant past memories
        retrieved = self.vector_store.query(
            query_texts=[current_query], n_results=5
        )
        relevant = "\n".join(retrieved["documents"][0])

        # 2. Assemble hierarchical context
        return (
            f"## Conversation Summary\n{self.summary}\n\n"
            f"## Relevant Past Context\n{relevant}\n\n"
            f"## Recent Messages\n{self.recent}"
        )

    def _compress(self):
        old = self.recent[:6]
        self.recent = self.recent[6:]
        resp = self.client.messages.create(
            model="claude-haiku-4-5-20251001",
            max_tokens=300,
            messages=[{
                "role": "user",
                "content": f"Current summary:\n{self.summary}\n\n"
                           f"New messages:\n{old}\n\n"
                           "Update the summary concisely."
            }]
        )
        self.summary = resp.content[0].text

🏷️

Entity Memory

Track facts about specific entities — users, projects, files. Update structured records over time.

🪞

Reflection

Agent periodically reflects on its memories, synthesizing higher-level insights from raw experiences.

🕳️

Forgetting

Decay old or low-importance memories. Not everything is worth remembering — controlled forgetting prevents noise.

6

Implementation Best Practices

Building memory systems is where theory meets production. These are the lessons that matter most:

📈

Start with Buffer, Scale to Vector

Simple buffer memory works for short conversations. Don't over-engineer — add vector memory only when you actually need to handle long histories or cross-session recall.

🧱

Separate Memory Concerns

Keep short-term, long-term, and working memory as distinct systems. Mixing them creates tangled code that's hard to debug and tune.

🎯

Always Set Token Budgets

Allocate fixed token budgets for each memory component — e.g., 500 tokens for summary, 1000 for retrieved context, 2000 for recent messages.

🧪

Test Memory Retrieval

Bad retrieval is worse than no memory — irrelevant context confuses the model. Test that your vector search actually returns useful results.

✓

Check Your Understanding

Quick Quiz — 3 Questions

1. Why can't LLMs remember previous conversations by default?

2. When would you choose summary memory over buffer memory?

3. What's the main advantage of vector memory?

✓

Topic 17 Summary

Here's what you've learned:

LLMs are stateless — memory must be explicitly managed. The four core memory types are buffer (full history), window (last N turns), summary (compressed history), and vector (embedding-based retrieval). Production systems use hierarchical memory that combines all three. Always set token budgets for each memory component and leave at least 25% of the context window for the model's response.

Next up → Topic 18: Agent Frameworks
You'll learn about LangChain, LangGraph, CrewAI, and other frameworks that provide memory, tool use, and orchestration out of the box.