A specification for how persistent memory gets injected into LLM calls.
AMP defines the context assembly pipeline — the layer between memory retrieval and prompt construction. It standardizes how external memory enters the model's context, regardless of how that memory is stored or fetched.
LLMs have a context window — the tokens they can see in a single call. Memory systems extend this with a context space — the full persistent memory available for retrieval and injection. But there is no standard for how context moves from the space into the window.
Today, every memory system invents its own injection approach:
- Some append plain text to the last user message
- Some prepend to the system prompt
- Some use tool calls to let the model pull memory on demand
- Some use a multi-turn loop where the model iteratively queries memory
These approaches have different tradeoffs in latency, quality, and cost. AMP standardizes the interface so memory providers and LLM clients can interoperate.
┌─────────────────────────────────────────────────┐
│ LLM Provider │
│ (Anthropic, OpenAI, etc.) │
└──────────────────────┬──────────────────────────┘
│
┌────────▼────────┐
│ AMP Layer │ ← context assembly,
│ │ gating, budgeting,
│ (this spec) │ injection, provenance
└────────┬────────┘
│
┌────────▼────────┐
│ Memory Source │ ← MCP servers, REST APIs,
│ │ vector DBs, knowledge graphs,
│ (any protocol) │ local files, etc.
└─────────────────┘
AMP operates above data fetching protocols like MCP and below the LLM API call. It is transport-agnostic and storage-agnostic.
The number of tokens an LLM can process in a single call. This is a model constraint (e.g., 200K tokens for Claude, 128K for GPT-4).
The total persistent memory available for retrieval and injection. This is a system property, not a model property. A context space might contain millions of chunks spanning years of conversations, documents, and knowledge — far exceeding any model's context window.
The process of selecting, ranking, budgeting, and formatting memory from the context space for injection into the context window. This is what AMP standardizes.
How memory enters the prompt. AMP defines three standard strategies:
- Append — memory appended to the last user message as formatted text
- System — memory injected into the system prompt
- Tool — memory delivered via tool call results in a multi-turn loop
Each strategy has defined behavior, tradeoffs, and compatibility notes.
When to skip injection entirely. AMP defines a gating pipeline:
- Gate 1: Rule-based — skip for greetings, commands, trivial queries (0ms)
- Gate 2: Semantic routing — embedding similarity to determine if memory is relevant (0ms marginal)
- Gate 3: Post-retrieval — relevance threshold on retrieved results
How to fit context space content into the context window:
- Maximum injection budget (absolute tokens or percentage of window)
- Chunk prioritization (relevance score, recency, diversity)
- Truncation strategy (per-chunk limit, total limit)
How injected content is attributed:
- Source identifier (which memory system provided it)
- Retrieval metadata (score, timestamp, session origin)
- Injection metadata (strategy used, budget consumed)
The iterative approach where the model queries memory via tool calls:
- Tool schema for memory search
- Maximum round limit
- Termination conditions
- Fallback to single-shot injection
Preventing the current conversation from being injected as context:
- Session identification
- Recency filters
- Deduplication rules
AMP and MCP (Model Context Protocol) are complementary specifications that operate at different layers. They do not compete.
| MCP | AMP | |
|---|---|---|
| Purpose | How clients discover and fetch data from servers | How fetched data gets assembled and injected into LLM prompts |
| Scope | Server discovery, resources, tools, prompts, sampling | Context assembly, gating, budgeting, injection, provenance |
| Layer | Between LLM client and data/tool servers | Between memory retrieval and the LLM API call |
| Transport | JSON-RPC 2.0 over stdio/HTTP | Transport-agnostic (works with any retrieval mechanism) |
| Question answered | "How do I get memory data?" | "How does memory data enter the prompt?" |
MCP gets data to the client but does not specify what happens next. AMP picks up where MCP stops — defining how retrieved memory is selected, ranked, budgeted, formatted, and placed into the LLM request.
An AMP host can use MCP to fetch memory from servers, or it can use REST APIs, local vector databases, or any other retrieval mechanism. The two protocols compose naturally:
LLM Client → AMP Host → MCP Server → Vector DB
↓
LLM Provider
For the full technical comparison, see Appendix A of the specification.
See spec/ for the full specification.
Memoryport implements AMP across its proxy, MCP server, and hosted API.
AMP is in early development. The specification is not yet stable.
This specification is licensed under Apache-2.0.