EngineeringMarch 6, 2026

Achieving 75.0% on LoCoMo

How CLAIV Memory leads the long-context memory benchmark.

LoCoMo (Long-Context Memory) is the most rigorous public benchmark for evaluating conversational memory systems. It tests whether a system can extract, store, and retrieve facts from extended multi-turn conversations—the exact problem CLAIV Memory is built to solve. Our V6 architecture scores 75.0% overall, outperforming Mem0 (~68%) and OpenAI Memory (~53%), with particularly strong results in open-domain generation and temporal reasoning. Here is how we got there and what the numbers mean.

What LoCoMo Measures

The benchmark consists of four task categories, each testing a different aspect of memory quality across 1,540 evaluation questions drawn from 10 complete dialogue datasets:

Open-domain generation (79.7%, 841 questions) — Can the system generate coherent, contextually accurate responses using recalled memory? This is the largest category and tests synthesis quality across the full breadth of the conversation.
Temporal reasoning (74.2%, 321 questions) — Can the system handle facts that change over time? Example: “Where did Alex live before London?” This directly evaluates temporal tracking—the ability to maintain historical states alongside current ones.
Single-hop recall (68.8%, 282 questions) — Can the system answer a question that requires retrieving a single fact from a past conversation? Example: “What is Alex’s favorite programming language?” This tests extraction accuracy and retrieval precision.
Multi-hop reasoning (55.2%, 96 questions) — Can the system answer questions that require combining multiple facts? Example: “Which of Alex’s hobbies started after they moved to Berlin?” This is the hardest category for any memory system.

How CLAIV’s Architecture Maps to Each Category

CLAIV’s strongest category is open-domain generation (79.7%), a direct result of structured memory feeding well-formed context into the generation step. Rather than injecting raw conversation history, CLAIV supplies the model with structured, evidence-linked facts — eliminating noise and enabling the model to produce grounded, coherent answers consistently.

Temporal reasoning (74.2%) is where CLAIV’s supersession model shows its clearest advantage. Most memory systems treat updates as overwrites—the old fact is deleted and the new one takes its place. CLAIV keeps both facts, linked by a temporal edge. When LoCoMo asks “Where did Alex live before London?”, the system follows the supersession chain from the current “lives in London” fact back to “lived in Berlin”. This explicit temporal graph is what makes the score competitive.

Single-hop recall (68.8%) reflects solid performance in direct factual retrieval. Every conversation turn is decomposed into subject-relation-object triples with evidence spans. When LoCoMo asks “What is Alex’s favorite language?”, the system queries a structured graph for triples matching subject:Alex, relation:favorite_language. This is more precise than vector similarity search, though extraction quality bounds the accuracy.

Multi-hop reasoning (55.2%) is the most challenging category for any memory system — including CLAIV. Multi-hop questions require combining facts from separate memory nodes to produce a single answer. Even with a structured graph, the difficulty lies in correctly traversing the right combination of edges when the question is ambiguous. This remains an active area of improvement.

Why Structured Extraction Beats Vector-Only

The conventional approach to conversational memory is to embed conversation chunks into a vector store and retrieve by cosine similarity. This works reasonably well for single-hop questions when the answer is explicitly stated in one chunk. It breaks down in three scenarios:

Implicit facts. A user says “I’m heading to the office, it’s a 10-minute walk from my flat in Shoreditch.” The vector store embeds this entire sentence. A later query “Where does the user live?” may or may not retrieve it depending on embedding similarity. CLAIV extracts user → lives_in → Shoreditch explicitly.
Contradictions. If a user later says “I moved to Hackney last month”, a vector store now has two chunks that both match “Where does the user live?”. Which one ranks higher depends on embedding geometry, not temporal order. CLAIV’s curation phase resolves this: the Shoreditch fact is superseded by the Hackney fact with a timestamp.
Multi-hop reasoning. Vector retrieval returns the N most similar chunks. There is no mechanism to combine facts from different chunks. CLAIV’s graph structure enables traversal across related facts, even if this remains the hardest category to score well on.

What the Scores Mean for Production

A 75.0% overall score means CLAIV correctly handles three out of four memory-dependent queries across all difficulty levels—outperforming Mem0 (~68%) and OpenAI Memory (~53%). In production, the effective accuracy skews higher because the most common real-world queries involve open-domain generation (79.7%) and direct fact retrieval (68.8% single-hop), while pure multi-hop relational questions are rarer.

More importantly, when CLAIV does retrieve a fact, it comes with evidence. The application can verify the source, display it to the user, or use it for compliance auditing. A vector store returns a text chunk with a similarity score—there is no way to verify that the retrieved information is actually what the user said.

The full LoCoMo evaluation methodology and our detailed results are available on the benchmarks page. We run the evaluation continuously against every release to track regression.

All Posts