Implementation note: the executable schema in
agent_data/sql/001_agent_data_schema.sqlfollows this design with production-required PostgreSQL adjustments such as composite primary keys on partitioned tables, asystem_promptsregistry for prompt-text recovery, helper-driven monthly history partition creation, multimodalcontent_partsretention, and lossless upstream API-call plus stream-event capture for testing.
This document specifies a complete PostgreSQL-only memory management module for agentic AI systems. The module manages three logical data planes per user_id — Sessions, Memory Store, and Conversation History — as partitioned, indexed, tiered PostgreSQL tables with strict provenance, deduplication, expiry, token-budget-aware retrieval, and deterministic context construction compatible with the OpenAI Python SDK (openai.chat.completions.create) and vLLM's OpenAI-compatible serving endpoint.
| Objective | Specification |
|---|---|
| Compatibility | Output a valid List[ChatCompletionMessageParam] directly from PostgreSQL retrieval |
| Memory Tiers | Working (L0), Session (L1), Episodic (L2), Semantic (L3), Procedural (L4) |
| Scope | Local (session-bound) and Global (user-bound, cross-session) |
| Durability | UNLOGGED for L0 (ephemeral); WAL-backed for L1–L4 |
| Write Amplification | Append-only conversation history; upsert-only memory; UNLOGGED working state |
| Read Efficiency | Hash-partitioned by user_id; composite B-tree + HNSW vector indexes |
| Retrieval | Hybrid: exact match + semantic (pgvector) + metadata filter + freshness + provenance scoring |
| Token Governance | Every stored item carries token_count; retrieval is budget-bounded via greedy knapsack |
| Provenance | Every memory item records originating session_id, turn_id, creation method, and confidence |
| Deduplication | Content-hash (SHA-256) + semantic near-duplicate detection ( |
| Compaction | Sliding-window summarization of old conversation turns; archival of expired memories |
| Extension | Purpose |
|---|---|
pgvector |
Embedding storage, HNSW/IVFFlat ANN indexes for semantic search |
pg_cron |
Scheduled compaction, expiry sweeps, promotion evaluations |
pg_stat_statements |
Query-level performance monitoring and optimization |
pg_trgm |
Trigram-based fuzzy text matching for exact/partial recall |
The context construction pipeline must output a structure directly consumable by:
openai.chat.completions.create(
model = <model_id>,
messages = List[ChatCompletionMessageParam],
tools = List[ChatCompletionToolParam], # optional
temperature = float,
max_tokens = int
)
Each message in the output list conforms to:
| Field | Type | Source Table |
|---|---|---|
role |
"system" | "user" | "assistant" | "tool" |
conversation_history.role or synthesized from memory_store |
content |
str | None |
conversation_history.content or compiled memory summary |
name |
str | None |
conversation_history.name |
tool_calls |
List[ToolCall] | None |
conversation_history.tool_calls (JSONB) |
tool_call_id |
str | None |
conversation_history.tool_call_id |
vLLM compatibility is identical (OpenAI-compatible API surface). The module is model-agnostic; context_window_size is parameterized per model and stored in the session record.
All three tables share user_id as the top-level hash-partition key, ensuring that all data for a single user is co-located within the same partition set across all three tables. This guarantees
Tracks the lifecycle, configuration, and summary state of every conversation session.
| Column | Type | Constraints | Purpose |
|---|---|---|---|
session_id |
UUID |
PK |
Globally unique session identifier |
user_id |
UUID |
NOT NULL, partition key |
Owner; hash-partition target |
created_at |
TIMESTAMPTZ |
NOT NULL DEFAULT now() |
Session creation timestamp |
updated_at |
TIMESTAMPTZ |
NOT NULL DEFAULT now() |
Last activity timestamp (trigger-maintained) |
expires_at |
TIMESTAMPTZ |
NOT NULL |
TTL expiry; after this, session is archivable |
status |
SMALLINT |
NOT NULL DEFAULT 1 |
Enum: 1=active, 2=paused, 3=closed, 4=expired |
model_id |
TEXT |
NOT NULL |
Model identifier (e.g., gpt-4o, vllm/llama-3) |
context_window_size |
INTEGER |
NOT NULL |
Max tokens for this model's context window |
system_prompt_hash |
BYTEA |
NOT NULL |
SHA-256 of the system prompt used |
system_prompt_tokens |
INTEGER |
NOT NULL |
Token count of system prompt |
temperature |
REAL |
NOT NULL DEFAULT 0.7 |
Sampling temperature |
top_p |
REAL |
NOT NULL DEFAULT 1.0 |
Nucleus sampling parameter |
max_response_tokens |
INTEGER |
NOT NULL DEFAULT 4096 |
Reserved response token budget |
turn_count |
INTEGER |
NOT NULL DEFAULT 0 |
Total turns in this session |
total_prompt_tokens |
BIGINT |
NOT NULL DEFAULT 0 |
Cumulative prompt tokens consumed |
total_completion_tokens |
BIGINT |
NOT NULL DEFAULT 0 |
Cumulative completion tokens consumed |
total_cost_microcents |
BIGINT |
NOT NULL DEFAULT 0 |
Cumulative cost in micro-cents |
summary_text |
TEXT |
Nullable | Compressed natural-language session summary |
summary_embedding |
vector(1536) |
Nullable | Embedding of summary_text for cross-session semantic search |
summary_token_count |
INTEGER |
DEFAULT 0 |
Token count of summary_text |
parent_session_id |
UUID |
Nullable, FK → sessions |
For session continuation / branching |
metadata |
JSONB |
NOT NULL DEFAULT '{}' |
Extensible: tags, source, client info, A/B variant |
Partitioning: PARTITION BY HASH (user_id) with
Design Rationale:
system_prompt_hashenables deduplication of identical system prompts across sessions without storing the full text repeatedly.summary_embeddingenables semantic search over past sessions (e.g., "find the session where I discussed X").total_cost_microcentsenables per-user cost tracking and budget enforcement at the session level.context_window_sizeandmax_response_tokensare stored per-session because different models have different limits, and a user may switch models mid-conversation history.
The canonical tiered memory table. Stores all validated memories across all tiers and scopes.
| Column | Type | Constraints | Purpose |
|---|---|---|---|
memory_id |
UUID |
PK |
Globally unique memory identifier |
user_id |
UUID |
NOT NULL, partition key |
Owner; hash-partition target |
session_id |
UUID |
Nullable, FK → sessions
|
Originating session (NULL for global/cross-session memories) |
memory_tier |
SMALLINT |
NOT NULL |
Enum: 0=working, 1=session, 2=episodic, 3=semantic, 4=procedural |
memory_scope |
SMALLINT |
NOT NULL |
Enum: 0=local (session-bound), 1=global (user-bound) |
content_text |
TEXT |
NOT NULL |
Natural-language memory content |
content_embedding |
vector(1536) |
NOT NULL |
Embedding of content_text (model-specific dimensionality) |
content_hash |
BYTEA |
NOT NULL |
SHA-256 of normalized content_text for deduplication |
provenance_type |
SMALLINT |
NOT NULL |
Enum: 0=user_stated, 1=system_inferred, 2=correction, 3=preference, 4=fact, 5=instruction, 6=tool_output |
provenance_turn_id |
UUID |
Nullable, FK → conversation_history
|
Exact turn that generated this memory |
confidence |
REAL |
NOT NULL DEFAULT 0.5 |
Confidence score |
relevance_accumulator |
REAL |
NOT NULL DEFAULT 0.0 |
Running sum of relevance scores at retrieval time |
access_count |
INTEGER |
NOT NULL DEFAULT 0 |
Number of times retrieved into context |
last_accessed_at |
TIMESTAMPTZ |
Nullable | Timestamp of most recent retrieval |
created_at |
TIMESTAMPTZ |
NOT NULL DEFAULT now() |
Creation timestamp |
updated_at |
TIMESTAMPTZ |
NOT NULL DEFAULT now() |
Last modification timestamp |
expires_at |
TIMESTAMPTZ |
Nullable | TTL; NULL = never expires |
is_active |
BOOLEAN |
NOT NULL DEFAULT true |
Soft-delete flag |
is_validated |
BOOLEAN |
NOT NULL DEFAULT false |
Whether memory has been validated (human or system) |
token_count |
INTEGER |
NOT NULL |
Token count of content_text (for budget calculations) |
superseded_by |
UUID |
Nullable, FK → memory_store
|
If this memory was corrected/updated, points to replacement |
tags |
TEXT[] |
NOT NULL DEFAULT '{}' |
Categorical labels for filtered retrieval |
metadata |
JSONB |
NOT NULL DEFAULT '{}' |
Extensible: extraction method, model version, etc. |
Partitioning: PARTITION BY HASH (user_id) with
Working Memory (L0) Variant: For tier 0 (working memory), a separate UNLOGGED table memory_store_working with identical schema is used. UNLOGGED tables bypass WAL, reducing write amplification to near zero for ephemeral state. On crash recovery, L0 data is lost — acceptable by definition since working memory is reconstructible from session context.
Design Rationale:
-
content_hashenables$O(1)$ exact deduplication before insert. -
content_embeddingwith HNSW index enables sub-20ms approximate nearest-neighbor retrieval. -
relevance_accumulatorandaccess_counttogether drive the promotion function (§5). -
superseded_bycreates a linked list of memory versions, enabling auditability without deleting history. -
provenance_turn_idprovides full traceability from memory back to the exact conversation turn. -
token_countis pre-computed at write time to avoid tokenization at read time — critical for budget calculations. -
memory_tierandmemory_scopeasSMALLINT(not TEXT enum) for storage efficiency and index performance.
Stores every message in every session in OpenAI-compatible format. This is the highest-volume table and is designed for append-only write patterns to minimize MVCC overhead.
| Column | Type | Constraints | Purpose |
|---|---|---|---|
message_id |
UUID |
PK |
Globally unique message identifier |
session_id |
UUID |
NOT NULL, FK → sessions |
Parent session |
user_id |
UUID |
NOT NULL, partition key |
Owner; hash-partition target |
turn_index |
INTEGER |
NOT NULL |
Monotonically increasing within session; ordering key |
role |
SMALLINT |
NOT NULL |
Enum: 0=system, 1=user, 2=assistant, 3=tool, 4=function |
content |
TEXT |
Nullable | Message content (NULL for tool_calls-only assistant messages) |
name |
TEXT |
Nullable | Function/tool name for role=tool/function |
tool_calls |
JSONB |
Nullable | OpenAI tool_calls structure (for assistant messages) |
tool_call_id |
TEXT |
Nullable | Tool call ID (for role=tool response messages) |
content_embedding |
vector(1536) |
Nullable | Embedding (computed async; for semantic search over history) |
content_hash |
BYTEA |
Nullable | SHA-256 of content (for dedup and integrity) |
token_count |
INTEGER |
NOT NULL |
Pre-computed token count of this message |
model_id |
TEXT |
Nullable | Model that generated this response (for assistant messages) |
finish_reason |
SMALLINT |
Nullable | Enum: 0=stop, 1=length, 2=tool_calls, 3=content_filter |
prompt_tokens |
INTEGER |
Nullable | Prompt tokens for this completion call |
completion_tokens |
INTEGER |
Nullable | Completion tokens for this response |
latency_ms |
INTEGER |
Nullable | End-to-end latency of the API call |
created_at |
TIMESTAMPTZ |
NOT NULL DEFAULT now() |
Message creation timestamp |
is_compressed |
BOOLEAN |
NOT NULL DEFAULT false |
Whether this message has been compacted into a summary |
compressed_summary_id |
UUID |
Nullable | Points to the summary message that replaced this turn |
is_pruned |
BOOLEAN |
NOT NULL DEFAULT false |
Whether this message has been pruned from active context |
metadata |
JSONB |
NOT NULL DEFAULT '{}' |
Extensible: safety flags, feedback, annotations |
Partitioning: Composite — PARTITION BY HASH (user_id) with RANGE (created_at) on monthly boundaries.
Append-Only Invariant: This table receives INSERT-only operations during normal conversation flow. The only UPDATE operations are:
- Setting
is_compressed = trueandcompressed_summary_idduring compaction (batch, async). - Setting
is_pruned = trueduring archival (batch, async). - Backfilling
content_embedding(async worker).
This invariant is critical: it reduces MVCC dead tuple generation to near zero under normal operation, directly addressing the primary write amplification concern identified in the OpenAI case study.
Design Rationale:
turn_indexis a session-local monotonic counter, not a global sequence, avoiding sequence contention.token_countis pre-computed using the appropriate tokenizer (e.g.,tiktokenfor OpenAI models, model-specific tokenizer for vLLM) at write time.content_embeddingis nullable and computed asynchronously by a background worker to avoid blocking the write path.- Separation of
prompt_tokensandcompletion_tokensenables per-message cost attribution. latency_msenables SLO monitoring and tail-latency analysis per user/session.
All three tables use user_id as the hash-partition key. This guarantees:
where
Properties:
-
Uniform distribution: UUID v4 provides near-perfect hash entropy; partition skew
$< 0.1%$ . - Co-location: All data for a single user maps to a deterministic partition index across all three tables. A single-user context construction query touches exactly one partition per table.
- Parallel vacuum: Each partition can be vacuumed independently, enabling parallel maintenance without global locks.
conversation_history is additionally sub-partitioned by created_at on monthly boundaries:
Benefits:
- Enables efficient time-range pruning:
DROP PARTITIONfor months beyond retention window (zero-cost deletion vs. row-by-rowDELETE). - Keeps active partitions small: recent months have high access frequency; old months are cold and can be moved to cheaper tablespaces.
- Compaction targets old sub-partitions exclusively.
For 800M users:
| Table | Partitions | Rows/Partition (estimate) | Access Pattern |
|---|---|---|---|
sessions |
256 | ~31K users × ~10 sessions = ~310K rows | Point lookup by (user_id, session_id) |
memory_store |
256 | ~31K users × ~50 memories = ~1.5M rows | Semantic search + filter within user |
conversation_history |
512 × 12 months = 6144 | ~15.6K users × ~500 msgs × 1 month = ~7.8M rows | Range scan by (user_id, session_id, turn_index) |
At these partition sizes, B-tree index depth is 3–4 levels, and HNSW graph traversal is
| Index Name | Type | Columns | Purpose |
|---|---|---|---|
idx_sessions_pk |
B-tree (PK) | (session_id) |
Primary key lookup |
idx_sessions_user_active |
B-tree, Partial | (user_id, updated_at DESC) WHERE status = 1 |
Fetch active sessions for a user, most recent first |
idx_sessions_user_all |
B-tree | (user_id, created_at DESC) |
List all sessions for a user |
idx_sessions_summary_vec |
HNSW | (summary_embedding) WHERE summary_embedding IS NOT NULL |
Semantic search over session summaries |
idx_sessions_expiry |
B-tree | (expires_at) WHERE status != 4 |
Expiry sweep (pg_cron) |
| Index Name | Type | Columns | Purpose |
|---|---|---|---|
idx_memory_pk |
B-tree (PK) | (memory_id) |
Primary key lookup |
idx_memory_user_tier |
B-tree | (user_id, memory_tier, is_active) WHERE is_active = true |
Tier-filtered retrieval per user |
idx_memory_user_scope |
B-tree | (user_id, memory_scope, memory_tier) WHERE is_active = true |
Scope-filtered retrieval |
idx_memory_content_hash |
B-tree | (user_id, content_hash) |
Deduplication check at write time |
idx_memory_embedding |
HNSW | (content_embedding) WHERE is_active = true |
Semantic nearest-neighbor search |
idx_memory_expiry |
B-tree | (expires_at) WHERE expires_at IS NOT NULL AND is_active = true |
Expiry sweep |
idx_memory_tags |
GIN | (tags) WHERE is_active = true |
Tag-based filtered retrieval |
idx_memory_superseded |
B-tree | (superseded_by) WHERE superseded_by IS NOT NULL |
Memory version chain traversal |
| Index Name | Type | Columns | Purpose |
|---|---|---|---|
idx_history_pk |
B-tree (PK) | (message_id) |
Primary key lookup |
idx_history_session_turn |
B-tree | (user_id, session_id, turn_index DESC) |
Critical: Fetch last N turns for context construction |
idx_history_session_active |
B-tree, Partial | (user_id, session_id, turn_index DESC) WHERE is_compressed = false AND is_pruned = false |
Active (non-compacted) messages only |
idx_history_embedding |
HNSW | (content_embedding) WHERE content_embedding IS NOT NULL |
Semantic search over conversation history |
idx_history_compaction |
B-tree | (user_id, session_id, created_at) WHERE is_compressed = false |
Compaction candidate selection |
B-tree Indexes: Each B-tree on a partition of
HNSW Indexes: For
- Build time:
$O(N \cdot \log N)$ - Query time:
$O(\log N)$ with recall$> 0.98$ at$ef_{search} = 100$ - Memory:
$\approx 1.2 \times N \times d \times 4$ bytes where$d = 1536$
Write Amplification from Indexes: Each INSERT into conversation_history updates 2–3 B-tree indexes + optionally 1 HNSW index (async). Since HNSW updates are deferred to the async embedding worker, the synchronous write path touches only B-tree indexes:
This is dramatically lower than the
| Tier | Name | Scope | TTL | Validation Required | WAL | Write Pattern |
|---|---|---|---|---|---|---|
| L0 | Working | Local | Session duration | No | UNLOGGED | High-frequency, ephemeral |
| L1 | Session | Local | Session TTL + 24h | No | Yes | Insert on extraction |
| L2 | Episodic | Global | 90 days (configurable) | Yes | Yes | Promoted from L1 |
| L3 | Semantic | Global | Never (manual expiry) | Yes | Yes | Promoted from L2 or direct user statement |
| L4 | Procedural | Global | Never (manual expiry) | Yes | Yes | Behavioral patterns, preferences |
A memory
where:
-
$\bar{r}(m) = \frac{R_{acc}(m)}{a(m)}$ is the mean relevance score across all retrievals (fromrelevance_accumulator / access_count) - $a(m) = $
access_count -
$\Delta t(m) = t_{now} - t_{created}$ in hours -
$\lambda_k$ is the tier-specific temporal discount factor -
$\theta_k$ is the promotion threshold for tier$k$
Tier-Specific Parameters:
| Transition | Minimum Access Count | ||
|---|---|---|---|
| L1 → L2 | 0.01 | 0.3 | 3 |
| L2 → L3 | 0.001 | 0.5 | 10 |
| L2 → L4 | 0.001 | 0.5 | 10 (+ behavioral pattern classification) |
A memory
where is_active = false) and archived to cold storage via pg_cron sweep.
At write time, before inserting a new memory candidate
-
Exact duplicate check: Query
idx_memory_content_hashfor(user_id, SHA256(normalize(content_text))). If match found → reject insert, incrementaccess_countof existing. -
Near-duplicate check: If no exact match, query HNSW index for top-1 nearest neighbor
$m_{nn}$ within the same user:
where superseded_by if replacing.
- No duplicate: Insert with
is_validated = false,memory_tier = 1(session),confidence = 0.5.
When two memories
Resolution policy (ordered by priority):
- Recency: Prefer the more recently created memory.
- Provenance authority:
user_stated > correction > system_inferred. - Confidence: Prefer higher
confidencescore. - Explicit supersession: Set
superseded_byon the deprecated memory.
The prefill compiler assembles a token-budget-bounded messages list from the three tables. This is the critical read path — it must execute within a strict latency budget of
Given model context window
This available budget
The allocation ratio is determined by task type:
| Task Type | Rationale | ||
|---|---|---|---|
| Continuation (default) | 0.15 | 0.85 | Prioritize recent conversation context |
| Knowledge-intensive | 0.40 | 0.60 | Prioritize retrieved memories |
| New session with history | 0.50 | 0.50 | Balanced: user context + fresh start |
| Tool-heavy | 0.10 | 0.70 | Reserve 20% for tool results in working memory |
For a user query
where:
-
$\Delta t_i = t_{now} - m_i.\text{last_accessed_at}$ (orcreated_atif never accessed) -
$a_i = m_i.\text{access_count}$ ,$a_{max} = \max_j a_j$ across user's memories $c_i = m_i.\text{confidence}$ -
$\pi(p_i)$ = provenance authority weight:
| Provenance Type | |
|---|---|
user_stated |
1.0 |
correction |
0.95 |
instruction |
0.90 |
preference |
0.85 |
fact |
0.80 |
system_inferred |
0.60 |
tool_output |
0.70 |
Default weights:
Given
Practical solution: Since
- Compute efficiency
$\eta_i = s_i / w_i$ for each candidate. - Sort candidates by
$\eta_i$ descending. - Greedily select candidates until budget is exhausted.
- Among conversation history items, enforce temporal ordering constraint: selected history messages must form a contiguous suffix of the conversation (most recent turns first).
The temporal ordering constraint transforms the history portion into a simpler problem: find the largest
where messages are indexed turn_index ascending. This is solvable in
PROCEDURE CompileContext(user_id, session_id, query_text, model_config):
// Phase 1: Budget Calculation
C ← model_config.context_window_size
S ← session.system_prompt_tokens
R ← session.max_response_tokens
T ← EstimateToolTokens(active_tools)
B ← C - S - R - T
task_type ← ClassifyTaskType(query_text, session.metadata)
(B_mem, B_hist) ← AllocateBudget(B, task_type)
// Phase 2: Conversation History Retrieval
// Single indexed query on (user_id, session_id, turn_index DESC)
// WHERE is_compressed = false AND is_pruned = false
history_messages ← FetchActiveHistory(user_id, session_id)
// Backward scan: select contiguous suffix within B_hist
selected_history ← []
tokens_used ← 0
FOR msg IN history_messages ORDER BY turn_index DESC:
IF tokens_used + msg.token_count > B_hist:
BREAK
selected_history.PREPEND(msg)
tokens_used ← tokens_used + msg.token_count
B_remaining ← B - tokens_used
// Phase 3: Memory Retrieval (Hybrid)
query_embedding ← ComputeEmbedding(query_text)
// Sub-query A: Semantic search (HNSW)
semantic_candidates ← VectorSearch(
user_id, query_embedding,
top_k=50,
filter=(is_active=true AND memory_tier >= 1)
)
// Sub-query B: Exact/keyword match (pg_trgm)
keyword_candidates ← KeywordSearch(
user_id, query_text, limit=20
)
// Sub-query C: High-access global memories (B-tree scan)
frequent_candidates ← FetchTopMemories(
user_id,
ORDER BY access_count DESC,
filter=(memory_scope=global AND is_active=true),
limit=20
)
// Merge, deduplicate, score
all_candidates ← UNION(semantic_candidates, keyword_candidates, frequent_candidates)
all_candidates ← DeduplicateByMemoryId(all_candidates)
FOR EACH candidate IN all_candidates:
candidate.score ← ComputeRelevanceScore(query_embedding, candidate)
// Greedy knapsack on memory budget
all_candidates.SORT_BY(score / token_count, DESC)
selected_memories ← []
mem_tokens ← 0
FOR candidate IN all_candidates:
IF mem_tokens + candidate.token_count > min(B_mem, B_remaining):
BREAK
selected_memories.APPEND(candidate)
mem_tokens ← mem_tokens + candidate.token_count
// Phase 4: Assemble Messages List
messages ← []
// 4a: System prompt (from session or default)
messages.APPEND({role: "system", content: LoadSystemPrompt(session)})
// 4b: Memory injection (as system context block)
IF selected_memories IS NOT EMPTY:
memory_block ← FormatMemoryBlock(selected_memories)
messages.APPEND({role: "system", content: memory_block})
// 4c: Session summary (if history was truncated)
IF selected_history[0].turn_index > 1:
summary ← session.summary_text
IF summary IS NOT NULL:
messages.APPEND({role: "system", content: "[Session Summary]: " + summary})
// 4d: Conversation history
FOR msg IN selected_history:
messages.APPEND(FormatAsOpenAIMessage(msg))
// Phase 5: Side Effects (async)
ASYNC UpdateMemoryAccessStats(selected_memories)
ASYNC UpdateSessionTokenUsage(session_id, tokens_used + mem_tokens)
RETURN messages
The returned messages list is directly passable to:
openai.chat.completions.create(model=model_id, messages=messages)
or equivalently to a vLLM server endpoint. No transformation layer is required.
PROCEDURE PersistConversationTurn(user_id, session_id, user_message, assistant_response, api_usage):
// Step 1: Insert user message (append-only)
next_turn ← session.turn_count + 1
INSERT INTO conversation_history (
message_id=UUID(), session_id, user_id,
turn_index=next_turn, role=USER,
content=user_message.content,
token_count=CountTokens(user_message.content, session.model_id),
created_at=now()
)
// Step 2: Insert assistant response (append-only)
INSERT INTO conversation_history (
message_id=UUID(), session_id, user_id,
turn_index=next_turn+1, role=ASSISTANT,
content=assistant_response.content,
tool_calls=assistant_response.tool_calls,
token_count=CountTokens(assistant_response.content, session.model_id),
model_id=session.model_id,
finish_reason=assistant_response.finish_reason,
prompt_tokens=api_usage.prompt_tokens,
completion_tokens=api_usage.completion_tokens,
latency_ms=api_usage.latency_ms,
created_at=now()
)
// Step 3: Update session metadata (single-row UPDATE)
UPDATE sessions SET
turn_count = turn_count + 2,
total_prompt_tokens = total_prompt_tokens + api_usage.prompt_tokens,
total_completion_tokens = total_completion_tokens + api_usage.completion_tokens,
total_cost_microcents = total_cost_microcents + ComputeCost(api_usage),
updated_at = now()
WHERE session_id = session_id AND user_id = user_id
// Step 4: Async memory extraction
ENQUEUE MemoryExtractionJob(user_id, session_id, user_message, assistant_response)
// Step 5: Async embedding computation
ENQUEUE EmbeddingComputeJob(user_id, [user_msg_id, assistant_msg_id])
PROCEDURE ExtractAndIngestMemories(user_id, session_id, user_msg, assistant_msg):
// Step 1: Extract memory candidates using LLM
candidates ← LLMExtract(
prompt="Extract factual statements, user preferences, corrections, and instructions from this exchange.",
messages=[user_msg, assistant_msg]
)
FOR EACH candidate IN candidates:
// Step 2: Normalize and hash
normalized ← NormalizeText(candidate.content)
hash ← SHA256(normalized)
embedding ← ComputeEmbedding(normalized)
tokens ← CountTokens(normalized, DEFAULT_MODEL)
// Step 3: Deduplication check
exact_match ← QUERY memory_store WHERE user_id=user_id AND content_hash=hash AND is_active=true
IF exact_match EXISTS:
UPDATE memory_store SET
access_count = access_count + 1,
updated_at = now()
WHERE memory_id = exact_match.memory_id
CONTINUE
near_match ← QUERY memory_store
WHERE user_id=user_id AND is_active=true
ORDER BY content_embedding <=> embedding
LIMIT 1
IF near_match EXISTS AND cosine_similarity(embedding, near_match.embedding) > 0.92:
// Near-duplicate: merge
IF candidate IS MORE SPECIFIC THAN near_match:
UPDATE memory_store SET
superseded_by = NULL, // this becomes the canonical version
content_text = normalized,
content_embedding = embedding,
content_hash = hash,
token_count = tokens,
access_count = access_count + 1,
updated_at = now()
WHERE memory_id = near_match.memory_id
ELSE:
UPDATE memory_store SET access_count = access_count + 1
WHERE memory_id = near_match.memory_id
CONTINUE
// Step 4: Insert new memory
INSERT INTO memory_store (
memory_id=UUID(), user_id, session_id,
memory_tier=SESSION, // L1
memory_scope=LOCAL,
content_text=normalized,
content_embedding=embedding,
content_hash=hash,
provenance_type=candidate.provenance_type,
provenance_turn_id=user_msg.message_id,
confidence=0.5,
token_count=tokens,
is_validated=false,
is_active=true,
expires_at=now() + INTERVAL '24 hours', // L1 TTL
created_at=now()
)
PROCEDURE CompactSessionHistory(user_id, session_id, target_token_budget):
// Step 1: Fetch all active (uncompressed) messages
active_messages ← QUERY conversation_history
WHERE user_id=user_id AND session_id=session_id
AND is_compressed=false AND is_pruned=false
ORDER BY turn_index ASC
total_tokens ← SUM(active_messages.token_count)
IF total_tokens <= target_token_budget:
RETURN // No compaction needed
// Step 2: Determine compaction window
// Keep the most recent K turns intact; compact everything before
tokens_to_remove ← total_tokens - target_token_budget
compaction_window ← []
removed_tokens ← 0
FOR msg IN active_messages ORDER BY turn_index ASC:
IF removed_tokens >= tokens_to_remove:
BREAK
compaction_window.APPEND(msg)
removed_tokens ← removed_tokens + msg.token_count
// Step 3: Summarize compaction window
summary_text ← LLMSummarize(
prompt="Summarize this conversation segment preserving key facts, decisions, and context.",
messages=compaction_window
)
summary_tokens ← CountTokens(summary_text, session.model_id)
summary_embedding ← ComputeEmbedding(summary_text)
// Step 4: Persist summary as a synthetic message
summary_msg_id ← UUID()
INSERT INTO conversation_history (
message_id=summary_msg_id, session_id, user_id,
turn_index=-1, // Sentinel: summary always precedes active history
role=SYSTEM,
content="[Conversation Summary]: " + summary_text,
token_count=summary_tokens,
content_embedding=summary_embedding,
is_compressed=false,
created_at=now()
)
// Step 5: Mark compacted messages
UPDATE conversation_history SET
is_compressed = true,
compressed_summary_id = summary_msg_id
WHERE message_id IN (compaction_window.message_ids)
// Step 6: Update session summary
UPDATE sessions SET
summary_text = summary_text,
summary_embedding = summary_embedding,
summary_token_count = summary_tokens,
updated_at = now()
WHERE session_id = session_id AND user_id = user_id
// Compression ratio achieved
ρ ← removed_tokens / summary_tokens
LOG("Compaction ratio: " + ρ) // Target ρ ≥ 5
Case Study (General MVCC):
For an UPDATE on a row of size
Effective amplification factor:
Our Scheme:
| Operation | Amplification Factor | Justification |
|---|---|---|
| History INSERT (append-only) |
|
|
| Memory INSERT (new) | 4 B-trees + deferred HNSW | |
| Memory UPDATE (access_count) | MVCC tuple copy + index updates | |
| Working memory (UNLOGGED) | No WAL overhead | |
| Session UPDATE | Single-row MVCC; infrequent |
Weighted average (assuming 70% history inserts, 15% memory inserts, 10% memory updates, 5% session updates):
Improvement:
For a single-user context construction query:
| Component | Latency | Method |
|---|---|---|
| Hash mod (computed in query planner) | ||
| B-tree index scan, last |
||
| HNSW ANN search, top-50 | ||
| pg_trgm trigram search | ||
| In-memory scoring of |
||
| Message list construction |
The three retrieval sub-queries execute in parallel (using async connection pooling or dblink/parallel query):
p99 target:
Comparison with case study: The case study reports "low double-digit millisecond p99 client-side latency" for general queries. Our context construction achieves comparable latency while performing semantic search, scoring, and budget-bounded assembly — a strictly more complex operation.
Per-user storage estimate (amortized over steady-state usage):
| Table | Rows/User | Avg Row Size | Storage/User |
|---|---|---|---|
sessions |
50 | 2 KB (including summary embedding) | 100 KB |
memory_store |
200 | 7 KB (including 1536-dim embedding) | 1.4 MB |
conversation_history |
5,000 | 3 KB (content + optional embedding) | 15 MB |
Total per user:
For 800M users:
Embedding storage dominance: Embeddings at
Define utilization efficiency as:
The first term measures quality of selection; the second measures budget fill rate.
With greedy knapsack on efficiency-sorted candidates, empirically:
This means
WAL volume per second:
where
Our scheme's WAL volume reduction vs. case study baseline:
| Source | Reduction | Mechanism |
|---|---|---|
| Working memory | UNLOGGED table | |
| History (append-only vs. update) | No dead tuples in WAL | |
| Lazy memory writes | Batched, debounced updates | |
| Net WAL volume reduction |
Replication lag:
Reducing
PROCEDURE PromotionSweep():
FOR EACH tier_transition IN [(L1→L2, θ=0.3, min_access=3),
(L2→L3, θ=0.5, min_access=10)]:
candidates ← QUERY memory_store
WHERE memory_tier = source_tier
AND is_active = true
AND access_count >= min_access
AND UtilityScore(relevance_accumulator, access_count, created_at) > θ
FOR EACH candidate IN candidates:
UPDATE memory_store SET
memory_tier = target_tier,
memory_scope = GLOBAL, // promoted memories become global
is_validated = true,
expires_at = ComputeNewExpiry(target_tier),
updated_at = now()
WHERE memory_id = candidate.memory_id
PROCEDURE ExpirySweep():
// Soft-delete expired memories
UPDATE memory_store SET is_active = false, updated_at = now()
WHERE expires_at < now() AND is_active = true
// Expire inactive sessions
UPDATE sessions SET status = EXPIRED, updated_at = now()
WHERE expires_at < now() AND status != EXPIRED
// Archive old compressed history (older than retention window)
// Move to cold tablespace or partition-drop
FOR EACH partition IN conversation_history_partitions:
IF partition.max_created_at < now() - RETENTION_WINDOW:
IF AllMessagesCompressedOrPruned(partition):
DetachAndArchivePartition(partition)
PROCEDURE CompactionSweep():
active_sessions ← QUERY sessions WHERE status = ACTIVE
FOR EACH session IN active_sessions:
active_token_count ← QUERY SUM(token_count) FROM conversation_history
WHERE session_id = session.session_id
AND is_compressed = false AND is_pruned = false
compaction_threshold ← session.context_window_size * 2 // Compact when 2x context window
IF active_token_count > compaction_threshold:
CompactSessionHistory(
session.user_id,
session.session_id,
target_token_budget = session.context_window_size
)
| Metric | Case Study (Reported) | Our Scheme (Projected) | Improvement Factor |
|---|---|---|---|
| p99 Read Latency | "Low double-digit ms" (~15ms) |
|
Equivalent capability at higher complexity |
| Write Amplification |
|
|
|
| WAL Volume | Baseline | Enables |
|
| Replication Lag | "Near zero" | Lower (proportional to WAL reduction) | Directly improved |
|
Context Quality ( |
N/A (no memory system) | New capability | |
| Dead Tuple Rate | High (update-heavy MVCC) | Near zero (append-only dominant) |
|
| Vacuum Pressure | High (requires aggressive tuning) | Minimal (append-only + UNLOGGED) |
|
| Connection Efficiency | PgBouncer (50ms → 5ms) | PgBouncer + parallel sub-queries | Equivalent + parallel retrieval |
| Single-User Query | Generic OLTP | Partition-localized, 3-table join-free |
|
| Memory Deduplication | N/A | SHA-256 exact + |
New capability |
Read scalability (horizontal, via replicas):
With our WAL reduction, the primary can sustain
For the case study's ~50 replicas, this enables
Write scalability (vertical, single primary):
Since our write amplification is
This directly addresses the case study's identified weakness: "PostgreSQL's MVCC implementation makes it less efficient for write-heavy workloads."
| Failure Mode | Case Study Mitigation | Our Additional Mitigation |
|---|---|---|
| Cache miss storm | Cache locking | Per-user partition isolation limits blast radius; memory tier L0 (UNLOGGED) absorbs working state without WAL |
| Write spike (new feature) | Migrate to CosmosDB | Append-only history + UNLOGGED L0 reduces write cost; lazy memory writes debounce spikes |
| Expensive query | Query optimization, ORM review | No multi-table joins in critical path; all queries are single-table, partition-localized |
| Connection exhaustion | PgBouncer | PgBouncer + per-tier connection pools (history reads, memory reads, writes isolated) |
| Replica lag spike | HA standby, cascading replication | 35–45% WAL reduction + partition-level vacuum parallelism |
| Primary failure | HA failover | Same + working memory on UNLOGGED (reconstructible; no WAL dependency) |
[Application Layer]
│
├── Write Pool (PgBouncer, transaction mode)
│ └── Primary Instance
│
├── History Read Pool (PgBouncer, statement mode)
│ └── Regional Read Replicas (history-optimized)
│
├── Memory Read Pool (PgBouncer, statement mode)
│ └── Regional Read Replicas (pgvector HNSW-optimized)
│
└── Background Pool (PgBouncer, transaction mode)
└── Dedicated Replica (compaction, promotion, embedding)
Workload isolation: History reads, memory reads (semantic search), and writes are routed to separate connection pools backed by separate replica groups. This prevents semantic search latency variance from affecting history retrieval, and vice versa — directly implementing the "noisy neighbor" mitigation from the case study at a finer granularity.
| Parameter | Value | Rationale |
|---|---|---|
shared_buffers |
25% of RAM | Standard; hot partition pages cached |
effective_cache_size |
75% of RAM | Planner hint for index-vs-seqscan decisions |
work_mem |
64 MB | HNSW search working memory |
maintenance_work_mem |
2 GB | HNSW index build, vacuum |
max_wal_senders |
75 | Support 50+ replicas + cascading |
wal_level |
replica |
Required for streaming replication |
hot_standby_feedback |
on |
Prevent vacuum from removing tuples needed by replicas |
idle_in_transaction_session_timeout |
30s |
Prevent long-running idle transactions from blocking vacuum |
statement_timeout |
5s |
Fail fast on runaway queries |
hnsw.ef_search |
100 | Recall > 0.98 at acceptable latency |
autovacuum_naptime |
30s |
Aggressive for memory_store (has updates) |
autovacuum_vacuum_cost_delay |
2ms |
Aggressive vacuum pacing |
| Signal | Source | Purpose |
|---|---|---|
| Query latency histogram | pg_stat_statements |
Detect query regressions per digest |
| Partition-level I/O |
pg_stat_user_tables per partition |
Detect hot partitions |
| Replication lag | pg_stat_replication |
Alert if lag > 100ms |
| Dead tuple ratio | pg_stat_user_tables.n_dead_tup / n_live_tup |
Trigger manual vacuum if > 5% |
| Connection pool saturation | PgBouncer metrics | Alert if pool utilization > 80% |
| HNSW recall | Application-level sampling | Verify retrieval quality |
|
Context utilization ( |
Application telemetry | Measure context packing quality |
| Memory promotion rate | Custom metric from promotion sweep | Monitor memory tier health |
|
Compaction ratio ( |
Compaction sweep logs | Ensure |
| Token cost per conversation turn | Application + sessions.total_cost_microcents
|
Cost optimization tracking |
┌────────────────────────┐
│ sessions │
│ (PK: session_id) │
│ (Partition: user_id) │
│ │
│ ● Lifecycle state │
│ ● Model configuration │
│ ● Token accounting │
│ ● Session summary + │
│ embedding │
└──────────┬─────────────┘
│ 1:N
▼
┌────────────────────────┐ ┌────────────────────────┐
│ conversation_history │ │ memory_store │
│ (PK: message_id) │ │ (PK: memory_id) │
│ (Partition: user_id) │ │ (Partition: user_id) │
│ (Sub-part: created_at) │ │ │
│ │◄────────│ ● provenance_turn_id │
│ ● OpenAI-compatible │ │ ● Tiered (L0–L4) │
│ message format │ │ ● Scoped (local/global)│
│ ● Append-only │ │ ● Embedding + hash │
│ ● Token counts │ │ ● Provenance + conf. │
│ ● Compression state │ │ ● Promotion/expiry │
│ ● Async embeddings │ │ ● Deduplication chain │
└────────────────────────┘ └────────────────────────┘
Relationship constraints:
sessions.user_id→ partition key (no FK enforcement across partitions; application-enforced)conversation_history.session_id→sessions.session_id(logical FK; enforced by application for partition-crossing safety)memory_store.session_id→sessions.session_id(nullable; NULL for global memories)memory_store.provenance_turn_id→conversation_history.message_id(nullable; traceability link)memory_store.superseded_by→memory_store.memory_id(self-referential; version chain)
Foreign key constraints are not enforced at the database level across hash-partitioned tables (PostgreSQL limitation). Referential integrity is maintained via application-level validation and the deduplication protocol.
| Invariant | Enforcement Mechanism |
|---|---|
| Append-only history | Application-layer write guard; no UPDATE/DELETE in normal path; only compaction sweep mutates is_compressed
|
| Token count accuracy | Computed at write time using model-specific tokenizer; stored, never recomputed at read time |
| Deduplication before insert | Write path always checks content_hash index + HNSW top-1 before INSERT |
| Budget-bounded retrieval | Prefill compiler enforces |
| Tier promotion requires validation | L2+ memories require is_validated = true; promotion sweep enforces this |
| Working memory is ephemeral | L0 stored in UNLOGGED table; crash-safe recovery does not depend on L0 |
| No multi-table joins in critical path | All retrieval queries are single-table, partition-localized; assembly in application layer |
| Embedding computation is async | Never blocks the synchronous write path; background worker backfills via queue |
| Session cost accounting is monotonic |
total_*_tokens and total_cost_microcents only increment; never reset |
This module achieves superior performance over the case study baseline through five structural advantages:
-
Write amplification reduction (
$3.1\times$ ): Append-only conversation history, UNLOGGED working memory, and deferred embedding computation eliminate the dominant MVCC overhead that the case study identifies as PostgreSQL's primary weakness. -
Zero multi-table joins: The case study identifies "an extremely costly query that joined 12 tables" as a recurring SEV cause. Our schema requires zero joins in the critical path — all queries are single-table, partition-localized, index-assisted.
-
WAL volume reduction (
$35\text{–}45%$ ): Directly enables more read replicas before primary network saturation, extending the scalability runway the case study is actively pursuing with cascading replication. -
Semantic retrieval within PostgreSQL: pgvector HNSW indexes provide sub-20ms semantic search without external vector store dependency, eliminating a network hop and an additional failure domain.
-
Token-budget-aware retrieval: The greedy knapsack packing algorithm achieves
$\eta_{ctx} \geq 0.92$ context window utilization, a capability that does not exist in the case study's architecture and directly improves downstream model response quality per dollar spent.
The scheme is generalizable across any OpenAI-compatible endpoint (OpenAI API, vLLM, TGI, Ollama) because the output contract is a standard List[ChatCompletionMessageParam] — the module is endpoint-agnostic by construction.