Claiv Memory Benchmark Results
Claiv achieves 75.0% overall accuracy on the LoCoMo benchmark, outperforming other AI memory systems in long-context recall and reasoning.
Measured across 1,540 evaluation questions covering single-hop recall, multi-hop reasoning, temporal reasoning, and open-ended generation.
About the Benchmark
What is the LoCoMo Benchmark?
LoCoMo (Long Context Memory) evaluates how well AI systems can remember, retrieve, and reason over long conversations. The benchmark simulates real multi-session interactions and tests whether systems can answer questions using information mentioned hundreds or thousands of tokens earlier.
It evaluates four categories of memory capability: single-hop retrieval, multi-hop reasoning, temporal reasoning, and open-domain generation.
Single-Hop Retrieval
Direct recall of previously mentioned facts.
Multi-Hop Reasoning
Connecting multiple memory nodes to answer a question.
Temporal Reasoning
Understanding how facts change over time.
Open-Domain Generation
Coherent answers grounded in remembered information.
Performance
Claiv Benchmark Results
Evaluated across 10 complete LoCoMo dialogue datasets — 1,540 benchmark questions.
Open-Domain Generation
79.7%841 questions
Generating coherent responses grounded in stored memory across long conversations.
Temporal Reasoning
74.2%321 questions
Tracking how facts change over time and answering time-dependent questions accurately.
Single-Hop Recall
68.8%282 questions
Direct retrieval of previously mentioned facts from structured memory nodes.
Multi-Hop Reasoning
55.2%96 questions
Connecting multiple pieces of stored memory to answer relational questions.
Competitive Comparison
Comparison With Other Memory Systems
Independent LoCoMo benchmark results reported for other systems.
| System | LoCoMo Score |
|---|---|
Claiv MemoryBest | 75% |
Mem0 | ~68% |
OpenAI Memory | ~53% |
Claiv outperforms other memory approaches due to its structured memory architecture and deterministic recall model.
Architecture
Why Claiv Performs Well
Most AI memory systems rely on conversation replay, vector similarity, or unstructured message storage. These approaches struggle with long-context recall. Claiv uses a fundamentally different architecture.
Structured Fact Extraction
Each conversation is parsed into structured memory entries with explicit fact extraction — not unstructured message storage.
Relational Memory Graphs
Facts are stored with typed relationships, enabling multi-hop reasoning without scanning entire conversation histories.
Evidence-Backed Storage
Every memory entry includes a source reference from the conversation, ensuring traceability and preventing hallucinated recall.
Deterministic Recall
Facts are retrieved using structured relationships rather than embedding similarity, producing stable and repeatable recall results.
Test Setup
Methodology
Claiv was evaluated using the full LoCoMo benchmark dataset:
Each question was answered using Claiv-retrieved memory and evaluated using an LLM-based judge scoring system.
Build AI systems that remember what matters.
Claiv provides production-ready memory infrastructure validated by benchmark performance.