One of the most common frustrations with AI assistants is that they forget. You explain your project at the start of a conversation, and by message twenty they have lost the thread. You close the chat and tomorrow it is like you never spoke.
This is not inevitable. The best AI agents have sophisticated memory systems that persist context across sessions, retrieve relevant information on demand, and accumulate knowledge over time. Understanding how these systems work helps you build better agents and choose tools that fit your needs.
The Four Types of AI Agent Memory
AI agent memory is typically described across four categories, each handling a different scope of information.
- •In-context memory: The content currently in the model's context window. Everything said in this conversation, the documents you have attached, the results of tool calls made so far. This is fast to access but limited in size. When the context window fills up, older content gets truncated or summarized.
- •External memory: Information stored outside the model in a database, usually a vector store. The agent retrieves relevant pieces when needed using semantic search. This is how agents access large knowledge bases without loading everything into context at once. Practically unlimited in size but requires a retrieval step.
- •Episodic memory: A log of past interactions, tasks completed, and outcomes observed. When an agent encounters a similar situation to one it has handled before, it can retrieve that past episode to inform its approach. Used in more sophisticated agent systems for learning from experience.
- •Procedural memory: Stored instructions, workflows, and skills that the agent knows how to execute. Less about facts and more about processes. This is often implemented as a library of prompts or tool definitions the agent can invoke.
How Context Windows Actually Work
Every language model has a context window: the maximum amount of text it can process in a single inference call. This includes the system prompt, the conversation history, any documents or tool outputs, and the response being generated.
Modern models have much larger context windows than their predecessors. Claude 3.5 supports 200,000 tokens. Gemini 1.5 Pro supports 1 million tokens. GPT-4o supports 128,000 tokens. At 100,000 tokens, you can fit roughly 75,000 words, equivalent to a short novel.
Large context windows reduce but do not eliminate the memory problem. They are expensive at scale (you pay per token processed), they slow down inference, and models can struggle to retrieve specific information from very long contexts reliably. Selective memory retrieval is usually more efficient than always loading the full context.
Vector Memory and Retrieval
Vector databases are the most common external memory implementation for AI agents. The way it works: text is converted into a numerical vector (a list of numbers representing its meaning). Millions of vectors are stored in the database. When the agent needs to retrieve relevant information, the query is converted to a vector and the database returns the stored items whose vectors are closest in meaning.
This is what makes semantic search work. Searching for "how do I cancel my subscription" retrieves documents about account management and cancellation even if they use different vocabulary. The search matches on meaning, not keywords.
Popular vector databases include Pinecone, Weaviate, Chroma, and Qdrant. LangChain and LlamaIndex both have extensive support for connecting agents to these stores.
Persistent Memory Across Sessions
For an agent to remember information across separate conversations, that information has to be written to storage when a session ends and retrieved when a new session starts.
This requires decisions about what to save. Saving everything is expensive and creates noise. Saving nothing means every session starts from scratch. Most practical implementations use a summarization approach: at the end of a session, the agent generates a structured summary of key facts, preferences, decisions, and outcomes, and stores that summary. Future sessions load the summary as part of the system prompt.
Consumer AI tools like Claude Projects and OpenAI's persistent memory implement a version of this. The agent is told to remember certain facts about the user and retrieves them at the start of each conversation.
Memory in Multi-Agent Systems
When multiple agents collaborate, shared memory becomes a coordination mechanism. Agents can write findings to a shared store that other agents read from, without needing to pass everything through a single context window.
This is particularly useful in long-horizon tasks where one agent completes stage one, writes its findings to memory, and a different agent picks up stage two hours later. The shared memory acts as the project's working document.
Getting shared memory right is one of the harder engineering problems in multi-agent systems. Writes need to be structured enough that agents can reliably retrieve what they need without reading everything. Conflicts between what different agents have written need to be resolved. And the memory needs to be durable enough to survive individual agent failures.
What This Means Practically
If you are choosing an AI agent tool for a use case that involves ongoing work, ask specifically about memory. How does the agent remember context from previous sessions? Can it be given background about your company, team, or workflow that persists? What happens to its context over a long task?
If you are building an agent system, design the memory architecture before anything else. Retrofitting memory into an existing agent is harder than building with it in mind from the start. Decide what needs to be in context at all times, what should be retrieved on demand, and what should persist across sessions, before you write the first tool integration.
