Official Benchmark Results

Claiv Memory Benchmark Results

Claiv achieves 75.0% overall accuracy on the LoCoMo benchmark, outperforming other AI memory systems in long-context recall and reasoning.

75%Overall LoCoMo Score

Measured across 1,540 evaluation questions covering single-hop recall, multi-hop reasoning, temporal reasoning, and open-ended generation.

About the Benchmark

What is the LoCoMo Benchmark?

LoCoMo (Long Context Memory) evaluates how well AI systems can remember, retrieve, and reason over long conversations. The benchmark simulates real multi-session interactions and tests whether systems can answer questions using information mentioned hundreds or thousands of tokens earlier.

It evaluates four categories of memory capability: single-hop retrieval, multi-hop reasoning, temporal reasoning, and open-domain generation.

Single-Hop Retrieval

Direct recall of previously mentioned facts.

Multi-Hop Reasoning

Connecting multiple memory nodes to answer a question.

Temporal Reasoning

Understanding how facts change over time.

Open-Domain Generation

Coherent answers grounded in remembered information.

Performance

Claiv Benchmark Results

Evaluated across 10 complete LoCoMo dialogue datasets — 1,540 benchmark questions.

Open-Domain Generation79.7%
Temporal Reasoning74.2%
Single-Hop Recall68.8%
Multi-Hop Reasoning55.2%

Open-Domain Generation

79.7%

841 questions

Generating coherent responses grounded in stored memory across long conversations.

Temporal Reasoning

74.2%

321 questions

Tracking how facts change over time and answering time-dependent questions accurately.

Single-Hop Recall

68.8%

282 questions

Direct retrieval of previously mentioned facts from structured memory nodes.

Multi-Hop Reasoning

55.2%

96 questions

Connecting multiple pieces of stored memory to answer relational questions.

Overall LoCoMo Score:75%

Competitive Comparison

Comparison With Other Memory Systems

Independent LoCoMo benchmark results reported for other systems.

SystemLoCoMo Score
Claiv MemoryBest
75%
Mem0
~68%
OpenAI Memory
~53%

Claiv outperforms other memory approaches due to its structured memory architecture and deterministic recall model.

Architecture

Why Claiv Performs Well

Most AI memory systems rely on conversation replay, vector similarity, or unstructured message storage. These approaches struggle with long-context recall. Claiv uses a fundamentally different architecture.

Structured Fact Extraction

Each conversation is parsed into structured memory entries with explicit fact extraction — not unstructured message storage.

Relational Memory Graphs

Facts are stored with typed relationships, enabling multi-hop reasoning without scanning entire conversation histories.

Evidence-Backed Storage

Every memory entry includes a source reference from the conversation, ensuring traceability and preventing hallucinated recall.

Deterministic Recall

Facts are retrieved using structured relationships rather than embedding similarity, producing stable and repeatable recall results.

Structured fact extraction replaces unstructured message storage
Evidence spans provide verifiable provenance for every recalled fact
Temporal edges track how facts evolve across conversation turns
Deterministic recall produces stable, repeatable results

Test Setup

Methodology

Claiv was evaluated using the full LoCoMo benchmark dataset:

10
Dialogue sets
1,540
Evaluation questions
4
Memory categories

Each question was answered using Claiv-retrieved memory and evaluated using an LLM-based judge scoring system.

Build AI systems that remember what matters.

Claiv provides production-ready memory infrastructure validated by benchmark performance.