The plug-and-play memory layer for smart, contextual agents
Memlayer adds persistent, intelligent memory to any LLM, enabling agents that recall context across conversations, extract structured knowledge, and surface relevant information when it matters.
<100ms Fast Search • Noise-Aware Memory Gate • Multi-Tier Retrieval Modes • 100% Local • Zero Config
- Features
- Quick Start
- Key Concepts
- Memory Modes
- Search Tiers
- Providers
- Advanced Features
- Examples
- Performance
- Documentation
- Contributing
- Universal LLM Support: Works with OpenAI, Claude, Gemini, Ollama models
- Plug-and-play: Install with
pip install memlayerand get started in minutes — minimal setup required. - Intelligent Memory Filtering: Three operation modes (LOCAL/ONLINE/LIGHTWEIGHT) automatically filter important information
- Hybrid Search: Combines vector similarity + knowledge graph traversal for accurate retrieval
- Three Search Tiers: Fast (<100ms), Balanced (<500ms), Deep (<2s) optimized for different use cases
- Knowledge Graph: Automatically extracts entities, relationships, and facts from conversations
- Proactive Reminders: Schedule tasks and get automatic reminders when they're due
- Built-in Observability: Trace every search operation with detailed performance metrics
- Flexible Storage: ChromaDB (vector) + NetworkX (graph) or graph-only mode
- Production Ready: Serverless-friendly with fast cold starts using online mode
pip install memlayerfrom memlayer.wrappers.openai import OpenAI
# Initialize with memory capabilities
client = OpenAI(
model="gpt-4.1-mini",
storage_path="./memories",
user_id="user_123"
)
# Store information automatically
client.chat([
{"role": "user", "content": "My name is Alice and I work at TechCorp"}
])
# Retrieve information automatically (no manual prompting needed!)
response = client.chat([
{"role": "user", "content": "Where do I work?"}
])
# Response: "You work at TechCorp."That's it! Memlayer automatically:
- ✅ Filters salient information using ML-based classification
- ✅ Extracts structured facts, entities, and relationships
- ✅ Stores memories in hybrid vector + graph storage
- ✅ Retrieves relevant context for each query
- ✅ Injects memories seamlessly into LLM context
Not all conversation content is worth storing. Memlayer uses salience gates to intelligently filter:
- ✅ Save: Facts, preferences, user info, decisions, relationships
- ❌ Skip: Greetings, acknowledgments, filler words, meta-conversation
Memories are stored in two complementary systems:
- Vector Store (ChromaDB): Semantic similarity search for facts
- Knowledge Graph (NetworkX): Entity relationships and structured knowledge
After each conversation, background threads:
- Extract facts, entities, and relationships using LLM
- Store facts in vector database with embeddings
- Build knowledge graph with entities and relationships
- Index everything for fast retrieval
Memlayer offers three modes that control both memory filtering (salience) and storage:
client = OpenAI(salience_mode="local")- Filtering: Sentence-transformers ML model (high accuracy)
- Storage: ChromaDB (vector) + NetworkX (graph)
- Startup: ~10s (model loading)
- Best for: High-volume production, offline apps
- Cost: Free (no API calls)
client = OpenAI(salience_mode="online")- Filtering: OpenAI embeddings API (high accuracy)
- Storage: ChromaDB (vector) + NetworkX (graph)
- Startup: ~2s (no model loading!)
- Best for: Serverless, cloud functions, fast cold starts
- Cost: ~$0.0001 per operation
client = OpenAI(salience_mode="lightweight")- Filtering: Keyword-based (medium accuracy)
- Storage: NetworkX only (no vector storage!)
- Startup: <1s (instant)
- Best for: Prototyping, testing, low-resource environments
- Cost: Free (no embeddings at all)
Performance Comparison:
Mode Startup Time Accuracy API Cost Storage
──────────────────────────────────────────────────────────────
LOCAL ~10s High Free Vector+Graph
ONLINE ~2s High $0.0001/op Vector+Graph
LIGHTWEIGHT <1s Medium Free Graph-only
Memlayer provides three search tiers optimized for different latency requirements:
# Automatic - LLM chooses based on query complexity
response = client.chat([{"role": "user", "content": "What's my name?"}])- 2 vector search results
- No graph traversal
- Perfect for: Real-time chat, simple factual recall
# Automatic - handles most queries well
response = client.chat([{"role": "user", "content": "Tell me about my projects"}])- 5 vector search results
- No graph traversal
- Perfect for: General conversation, most use cases
# Explicit request or auto-detected for complex queries
response = client.chat([{
"role": "user",
"content": "Use deep search: Tell me everything about Alice and her relationships"
}])- 10 vector search results
- Graph traversal enabled (entity extraction + 1-hop relationships)
- Perfect for: Research, "tell me everything", multi-hop reasoning
Memlayer works with all major LLM providers:
from memlayer.wrappers.openai import OpenAI
client = OpenAI(
model="gpt-4.1-mini", # or gpt-4.1, gpt-5, etc.
storage_path="./memories",
user_id="user_123"
)from memlayer.wrappers.claude import Claude
client = Claude(
model="claude-4-sonnet",
storage_path="./memories",
user_id="user_123"
)from memlayer.wrappers.gemini import Gemini
client = Gemini(
model="gemini-2.5-flash",
storage_path="./memories",
user_id="user_123"
)from memlayer.wrappers.ollama import Ollama
client = Ollama(
host="http://localhost:11434",
model="qwen3:1.7b", # or llama3.2, mistral, etc.
storage_path="./memories",
user_id="user_123",
salience_mode="local" # Run 100% offline!
)All providers share the same API - switch between them seamlessly!
# User schedules a task
client.chat([{
"role": "user",
"content": "Remind me to submit the report next Friday at 9am"
}])
# Later, when the task is due, Memlayer automatically injects it
response = client.chat([{"role": "user", "content": "What should I do today?"}])
# Response includes: "Don't forget to submit the report - it's due today at 9am!"response = client.chat(messages)
# Inspect search performance
if client.last_trace:
print(f"Search tier: {client.last_trace.events[0].metadata.get('tier')}")
print(f"Total time: {client.last_trace.total_duration_ms}ms")
for event in client.last_trace.events:
print(f" {event.event_type}: {event.duration_ms}ms")# Control memory filtering strictness
client = OpenAI(
salience_threshold=-0.1 # Permissive (saves more)
# salience_threshold=0.0 # Balanced (default)
# salience_threshold=0.1 # Strict (saves less)
)# Manually extract structured knowledge
kg = client.analyze_and_extract_knowledge(
"Alice leads Project Phoenix in the London office. The project uses Python and React."
)
print(kg["facts"]) # ["Alice leads Project Phoenix", ...]
print(kg["entities"]) # [{"name": "Alice", "type": "Person"}, ...]
print(kg["relationships"]) # [{"subject": "Alice", "predicate": "leads", "object": "Project Phoenix"}]Explore the examples/ directory for comprehensive examples:
# Getting started
python examples/01_basics/getting_started.py# Try all three search tiers
python examples/02_search_tiers/fast_tier_example.py
python examples/02_search_tiers/balanced_tier_example.py
python examples/02_search_tiers/deep_tier_example.py
# Compare them side-by-side
python examples/02_search_tiers/tier_comparison.py# Proactive task reminders
python examples/03_features/task_reminders.py
# Knowledge graph visualization
python examples/03_features/test_knowledge_graph.py# Compare salience modes
python examples/04_benchmarks/compare_salience_modes.py# Try different LLM providers
python examples/05_providers/openai_example.py
python examples/05_providers/claude_example.py
python examples/05_providers/gemini_example.py
python examples/05_providers/ollama_example.pySee examples/README.md for full documentation.
Real-world startup times from benchmarks:
Mode First Use Memory Savings Trade-off
─────────────────────────────────────────────────────────
LIGHTWEIGHT ~5s No embeddings No semantic search
ONLINE ~5s 5s faster Small API cost
LOCAL ~10s No API cost 11s model loading
Typical query latencies:
Tier Latency Vector Results Graph Use Case
────────────────────────────────────────────────────────────
Fast 50-150ms 2 No Real-time chat
Balanced 200-600ms 5 No General use
Deep 800-2500ms 10 Yes Research queries
Background processing (non-blocking):
Step Time Async
──────────────────────────────────────────────
Salience filtering ~10ms Yes
Knowledge extraction ~1-2s Yes (background thread)
Vector storage ~50ms Yes
Graph storage ~20ms Yes
Total (non-blocking) ~0ms User doesn't wait!
- Basics Overview - Architecture, components, and how Memlayer works
- Quickstart Guide - Get up and running in 5 minutes
- Streaming Mode - Complete guide to streaming responses
- API Reference - Complete API documentation with all methods and parameters
- Providers Overview - Compare all providers, choose the right one
- Ollama Setup - Run completely offline with local models
- OpenAI - OpenAI configuration
- Claude - Anthropic Claude setup
- Gemini - Google Gemini configuration
- Examples Index - Comprehensive examples by category
- Provider Examples - Provider comparison and usage
The project exposes several runtime/configuration knobs you can tune to match latency, cost, and accuracy trade-offs. Detailed docs for each area live in the docs/ folder:
- docs/tuning/operation_mode.md — Architecture deep dive: How to choose between
online,local, andlightweightmodes, performance implications, storage composition, and deployment strategies. - docs/tuning/intervals.md — Scheduler and curation interval configuration (
scheduler_interval_seconds,curation_interval_seconds) and practical guidance. - docs/tuning/salience_threshold.md — How to adjust
salience_thresholdand expected behavior. - docs/services/consolidation.md — Consolidation pipeline internals and how to call it programmatically (including
update_from_text). - docs/services/curation.md — How memory curation works, archiving rules, and how to run/stop the curation service.
- docs/storage/chroma.md — ChromaDB notes: metadata types, connection handling, and Windows file-lock guidance.
- docs/storage/networkx.md — Knowledge graph persistence, expected node schemas, and backup/restore tips.
Use the docs when tuning for production. The following docs/ files were added to this repository and provide detailed, practical guidance.
# Clone repository
git clone https://github.com/divagr18/memlayer.git
cd memlayer
# Install dependencies
pip install -e .
# Run tests
python -m pytest tests/
# Run examples
python examples/01_basics/getting_started.pymemlayer/
├── memlayer/ # Core library
│ ├── wrappers/ # LLM provider wrappers
│ ├── storage/ # Storage backends (ChromaDB, NetworkX)
│ ├── services.py # Search & consolidation services
│ ├── ml_gate.py # Salience filtering
│ └── embedding_models.py # Embedding model implementations
├── examples/ # Organized examples by category
│ ├── 01_basics/
│ ├── 02_search_tiers/
│ ├── 03_features/
│ ├── 04_benchmarks/
│ └── 05_providers/
├── tests/ # Tests and benchmarks
├── docs/ # Documentation
└── README.md # This file
Contributions are welcome! Here's how you can help:
- Report bugs - Open an issue with reproduction steps
- Suggest features - Share your use case and requirements
- Submit PRs - Fix bugs, add features, improve docs
- Share examples - Show us what you've built!
Please keep PRs focused and include tests for new features.
- Author/Maintainer: Divyansh Agrawal
- Email: [email protected]
- GitHub: divagr18
- Issues: Report bugs or request features via GitHub Issues
For security vulnerabilities, please email directly with SECURITY in the subject line instead of opening a public issue.
MIT License - see LICENSE for details.
- Built with ChromaDB for vector storage
- Uses NetworkX for knowledge graph operations
- Powered by sentence-transformers for local embeddings
- Supports OpenAI, Anthropic, Google Gemini, and Ollama
Made with ❤️ for the AI community
Give your LLMs memory. Try Memlayer today!