Official Benchmark Results

Claiv Memory Benchmark Results

Claiv achieves 75.0% overall accuracy on the LoCoMo benchmark, outperforming other AI memory systems in long-context recall and reasoning.

75%Overall LoCoMo Score

Measured across 1,540 evaluation questions covering single-hop recall, multi-hop reasoning, temporal reasoning, and open-ended generation.

View Methodology Start Building with Claiv

About the Benchmark

What is the LoCoMo Benchmark?

LoCoMo (Long Context Memory) evaluates how well AI systems can remember, retrieve, and reason over long conversations. The benchmark simulates real multi-session interactions and tests whether systems can answer questions using information mentioned hundreds or thousands of tokens earlier.

It evaluates four categories of memory capability: single-hop retrieval, multi-hop reasoning, temporal reasoning, and open-domain generation.

Single-Hop Retrieval

Direct recall of previously mentioned facts.

Multi-Hop Reasoning

Connecting multiple memory nodes to answer a question.

Temporal Reasoning

Understanding how facts change over time.

Open-Domain Generation

Coherent answers grounded in remembered information.

Performance

Claiv Benchmark Results

Evaluated across 10 complete LoCoMo dialogue datasets — 1,540 benchmark questions.

Open-Domain Generation79.7%

Temporal Reasoning74.2%

Single-Hop Recall68.8%

Multi-Hop Reasoning55.2%

Open-Domain Generation

79.7%

841 questions

Generating coherent responses grounded in stored memory across long conversations.

Temporal Reasoning

74.2%

321 questions

Tracking how facts change over time and answering time-dependent questions accurately.

Single-Hop Recall

68.8%

282 questions

Direct retrieval of previously mentioned facts from structured memory nodes.

Multi-Hop Reasoning

55.2%

96 questions

Connecting multiple pieces of stored memory to answer relational questions.

Overall LoCoMo Score:75%

Competitive Comparison

Comparison With Other Memory Systems

Independent LoCoMo benchmark results reported for other systems.

System	LoCoMo Score
Claiv MemoryBest	75%
Mem0	~68%
OpenAI Memory	~53%

Claiv outperforms other memory approaches due to its structured memory architecture and deterministic recall model.

Architecture

Why Claiv Performs Well

Most AI memory systems rely on conversation replay, vector similarity, or unstructured message storage. These approaches struggle with long-context recall. Claiv uses a fundamentally different architecture.

Structured Fact Extraction

Each conversation is parsed into structured memory entries with explicit fact extraction — not unstructured message storage.

Relational Memory Graphs

Facts are stored with typed relationships, enabling multi-hop reasoning without scanning entire conversation histories.

Evidence-Backed Storage

Every memory entry includes a source reference from the conversation, ensuring traceability and preventing hallucinated recall.

Deterministic Recall

Facts are retrieved using structured relationships rather than embedding similarity, producing stable and repeatable recall results.

Structured fact extraction replaces unstructured message storage

Evidence spans provide verifiable provenance for every recalled fact

Temporal edges track how facts evolve across conversation turns

Deterministic recall produces stable, repeatable results

Test Setup

Methodology

Claiv was evaluated using the full LoCoMo benchmark dataset:

Dialogue sets

1,540

Evaluation questions

Memory categories

Each question was answered using Claiv-retrieved memory and evaluated using an LLM-based judge scoring system.

Build AI systems that remember what matters.

Claiv provides production-ready memory infrastructure validated by benchmark performance.

Start Building with Claiv Read the Docs