This folder contains the evaluation suites used by EverOS to measure memory quality and agent self-evolution. Use these benchmarks to reproduce reported results, compare memory systems, or evaluate new agent learning methods.
| Benchmark | What it measures | Start here |
|---|---|---|
| EverMemBench | Long-term memory quality in multi-person group conversations, including factual recall, applied reasoning, and personalized generalization. | EverMemBench/ |
| EvoAgentBench | Agent self-evolution across information retrieval, reasoning, software engineering, code implementation, and knowledge-work tasks. | EvoAgentBench/ |
- Start with EverMemBench/ to evaluate memory retrieval and answer quality.
- Start with EvoAgentBench/ to evaluate whether agents improve from past experience.
- Use the top-level Benchmarks and Evaluation sections for the project-level benchmark overview.