Memvid achieves 85.7% on the LoCoMo benchmark

A 35% relative improvement over leading memory systems on long-term conversational understanding.

Overall Results

Categories 1-4 Accuracy

Memvid

85.65%

Full-context

72.90%

Mem0ᵍ

68.44%

Mem0

66.88%

Zep

65.99%

LangMem

58.10%

OpenAI

52.90%

Benchmark: LoCoMo · 10 conversations · ~26K tokens each · LLM-as-Judge evaluation

Source: Baseline results from arXiv:2504.19413. Some vendors dispute these figures; see paper for methodology.

By Category

Category	Memvid	Mem0	Mem0ᵍ	Zep	OpenAI	Δ vs Avg
Single-hop	80.1%	67.1%	65.7%	61.7%	63.8%	+24%
Multi-hop	80.4%	51.1%	47.2%	41.4%	42.9%	+76%
Temporal	71.9%	55.5%	58.1%	49.3%	21.7%	+56%
World-knowledge	91.1%	72.9%	75.7%	76.6%	62.3%	+27%
Adversarial	77.8%	—	—	—	—	—
Overall (Cat. 1-4)	85.65%	66.88%	68.44%	65.99%	52.90%	+35%

Following standard methodology, adversarial category is excluded from the primary metric.

Baseline figures sourced from arXiv:2504.19413. Results for some systems are disputed by their vendors.

Configuration

Embedding

text-embedding-3-large

Answer Model

GPT-4o

Judge Model

GPT-4o-mini

Hybrid (BM25 + Semantic)

Top-K

Questions

1,986

Open source benchmark suite

Our benchmark implementation is fully open source. Run the complete evaluation suite yourself and verify the results.

memvid/memvidbenchbun run bench:full

References

LoCoMo: Long-Context Conversational Memory Dataset and Benchmark

Maharana et al. · ACL 2024 · arXiv:2402.17753

Baseline Benchmark Results Source

Chhikara et al. · 2025 · arXiv:2504.19413

LoCoMo Benchmark Methodology Discussion

Community Discussion · GitHub · 2025