Give your AI agents persistent memory — from conversation buffers to vector stores and hierarchical memory systems.
Every time you call an LLM, it starts completely fresh. It has no memory of anything that happened before. This is a fundamental architectural constraint — and the biggest challenge when building agents that need to maintain continuity.
The core issues are clear:
The current conversation — messages in the active context window. Lost when the session ends.
Persisted across sessions — stored in databases, files, or vector stores. Survives restarts.
Specific events and experiences — "the user reported a bug on Tuesday" or "we deployed v2.1 last week."
General knowledge and facts — "the user prefers Python" or "the project uses PostgreSQL."
Think of an LLM without memory like a person with amnesia — brilliant in the moment, but can't remember yesterday's conversation. Every interaction, you have to re-introduce yourself and re-explain the context. Memory systems are how we cure that amnesia.
There are four core memory patterns (based on Lilian Weng's framework). Each makes a different tradeoff between completeness, token efficiency, and relevance.
Store the full conversation history. Simple and complete, but grows unbounded and eventually overflows the context window.
Keep only the last N turns. Caps token usage but loses older context — the agent forgets early parts of the conversation.
Use the LLM itself to summarize older messages. Preserves key information in fewer tokens, but lossy by nature.
Embed messages as vectors and retrieve only relevant ones via similarity search. Scales to massive histories.
Buffer Memory — the simplest approach. Store everything:
class BufferMemory: def __init__(self): self.messages = [] def add(self, role: str, content: str): self.messages.append({"role": role, "content": content}) def get_context(self) -> list[dict]: return self.messages.copy()
Summary Memory — compress older messages using the LLM:
class SummaryMemory: def __init__(self, client): self.client = client self.summary = "" self.recent = [] # Last 3 turns def add(self, role: str, content: str): self.recent.append({"role": role, "content": content}) if len(self.recent) > 6: # 3 turns = 6 messages self._compress() def _compress(self): old = self.recent[:4] self.recent = self.recent[4:] # Ask LLM to update summary resp = self.client.messages.create( model="claude-haiku-4-5-20251001", max_tokens=300, messages=[{ "role": "user", "content": f"Current summary:\n{self.summary}\n\n" f"New messages:\n{old}\n\n" "Update the summary concisely." }] ) self.summary = resp.content[0].text
Making the most of a limited context window is critical. Here are the main strategies, from simplest to most sophisticated:
| Strategy | How It Works | Tradeoff |
|---|---|---|
| Truncation | Drop the oldest messages when the window fills up | Simple but loses early context completely |
| Sliding Window | Keep only the last N tokens of conversation | Predictable cost, but no long-term recall |
| Summarization | Compress history into a rolling summary | Retains key info, but lossy — details get dropped |
| RAG | Retrieve only relevant context from a vector store | Scales well, but retrieval quality is critical |
| Hierarchical | Combine summary + recent messages + retrieved context | Best quality, but most complex to implement |
A well-designed context window allocates space for each component. Here's a typical layout:
Never fill the entire context window — always leave room for the model's response. A good rule: use no more than 75% for input. If you stuff the window full, the model either truncates its response or produces degraded output.
Vector memory uses embeddings to convert text into numerical vectors, then retrieves relevant memories via similarity search. The flow is: text → embedding model → vector → store in database → query by similarity.
This approach mirrors how human memory works — you don't recall every conversation you've ever had. Instead, the current context triggers retrieval of relevant memories.
import chromadb client = chromadb.Client() collection = client.create_collection("memories") # Store a memory collection.add( documents=["User prefers Python over JavaScript"], ids=["mem_001"], metadatas=[{"type": "preference", "date": "2025-01-15"}] ) # Retrieve relevant memories results = collection.query( query_texts=["What programming language should I recommend?"], n_results=3 ) # Inject into prompt memories = "\n".join(results["documents"][0]) system = f"User context:\n{memories}\n\nBe helpful."
Vector memory is the closest thing to how humans remember — you don't recall everything, just what's relevant to the current situation. A well-tuned vector store can search across thousands of past interactions and surface exactly the context the agent needs.
Production memory systems typically combine multiple memory types into a hierarchical architecture. This gives you the best of all worlds — recent context is complete, older context is summarized, and everything else is retrievable via semantic search.
class HierarchicalMemory: """Combines buffer + summary + vector memory.""" def __init__(self, client, vector_store): self.client = client self.vector_store = vector_store self.recent = [] # Last 5 turns (buffer) self.summary = "" # Rolling summary def add(self, role: str, content: str): self.recent.append({"role": role, "content": content}) # Also store in vector DB for long-term retrieval self.vector_store.add( documents=[content], ids=[f"msg_{len(self.recent)}"], metadatas=[{"role": role, "type": "conversation"}] ) if len(self.recent) > 10: self._compress() def get_context(self, current_query: str) -> str: # 1. Retrieve relevant past memories retrieved = self.vector_store.query( query_texts=[current_query], n_results=5 ) relevant = "\n".join(retrieved["documents"][0]) # 2. Assemble hierarchical context return ( f"## Conversation Summary\n{self.summary}\n\n" f"## Relevant Past Context\n{relevant}\n\n" f"## Recent Messages\n{self.recent}" ) def _compress(self): old = self.recent[:6] self.recent = self.recent[6:] resp = self.client.messages.create( model="claude-haiku-4-5-20251001", max_tokens=300, messages=[{ "role": "user", "content": f"Current summary:\n{self.summary}\n\n" f"New messages:\n{old}\n\n" "Update the summary concisely." }] ) self.summary = resp.content[0].text
Track facts about specific entities — users, projects, files. Update structured records over time.
Agent periodically reflects on its memories, synthesizing higher-level insights from raw experiences.
Decay old or low-importance memories. Not everything is worth remembering — controlled forgetting prevents noise.
Building memory systems is where theory meets production. These are the lessons that matter most:
Simple buffer memory works for short conversations. Don't over-engineer — add vector memory only when you actually need to handle long histories or cross-session recall.
Keep short-term, long-term, and working memory as distinct systems. Mixing them creates tangled code that's hard to debug and tune.
Allocate fixed token budgets for each memory component — e.g., 500 tokens for summary, 1000 for retrieved context, 2000 for recent messages.
Bad retrieval is worse than no memory — irrelevant context confuses the model. Test that your vector search actually returns useful results.
1. Why can't LLMs remember previous conversations by default?
2. When would you choose summary memory over buffer memory?
3. What's the main advantage of vector memory?
Here's what you've learned:
LLMs are stateless — memory must be explicitly managed. The four core memory types are buffer (full history), window (last N turns), summary (compressed history), and vector (embedding-based retrieval). Production systems use hierarchical memory that combines all three. Always set token budgets for each memory component and leave at least 25% of the context window for the model's response.
Next up → Topic 18: Agent Frameworks
You'll learn about LangChain, LangGraph, CrewAI, and other frameworks that provide memory, tool use, and orchestration out of the box.