A 35% relative improvement over leading memory systems on long-term conversational understanding.
Categories 1-4 Accuracy
Benchmark: LoCoMo · 10 conversations · ~26K tokens each · LLM-as-Judge evaluation
Source: Baseline results from arXiv:2504.19413. Some vendors dispute these figures; see paper for methodology.
| Category | Memvid | Mem0 | Mem0ᵍ | Zep | OpenAI | Δ vs Avg |
|---|---|---|---|---|---|---|
| Single-hop | 80.1% | 67.1% | 65.7% | 61.7% | 63.8% | +24% |
| Multi-hop | 80.4% | 51.1% | 47.2% | 41.4% | 42.9% | +76% |
| Temporal | 71.9% | 55.5% | 58.1% | 49.3% | 21.7% | +56% |
| World-knowledge | 91.1% | 72.9% | 75.7% | 76.6% | 62.3% | +27% |
| Adversarial | 77.8% | — | — | — | — | — |
| Overall (Cat. 1-4) | 85.65% | 66.88% | 68.44% | 65.99% | 52.90% | +35% |
Following standard methodology, adversarial category is excluded from the primary metric.
Baseline figures sourced from arXiv:2504.19413. Results for some systems are disputed by their vendors.
Open source benchmark suite
Our benchmark implementation is fully open source. Run the complete evaluation suite yourself and verify the results.
bun run bench:full