Is Mem0's 94.4% LongMemEval score real?

It is Mem0's own vendor-reported number from its April 2026 report, not a peer-reviewed or independently reproduced result. On Maximem's open same-harness audit, Mem0 scored 57.5% on LongMemEval, a 36.9-point gap. Treat the headline as a vendor claim and weight the independent reproductions more heavily.

Which agent memory product should I use in 2026?

If you run on Cloudflare Workers, use Cloudflare Agent Memory (free in private beta). If you want portability, use Mem0's Apache 2.0 open-source core. If you already run Weaviate, Engram is the lowest-friction add. Microsoft Foundry fits Azure-native enterprises needing procedural memory and identity integration.

Why do agent memory benchmark scores vary so much between sources?

Vendors run LongMemEval and LoCoMo with different answer models, retrieval settings, and LLM judges, so absolute scores are not comparable across reports. LoCoMo also has a roughly 6.4% wrong answer key, and judges accept about 63% of wrong answers per the AgentOS audit. Only same-harness comparisons with public logs are trustworthy.

AI Agent Memory Got Crowded. Here's What Shipped

Q: What is an AI agent memory layer?

It is a managed service that turns an agent's raw conversation and tool-call history into structured, durable, retrievable facts scoped to a user, session, or tenant. Instead of stuffing full history back into the context window, the agent writes memories on one path and recalls a small relevant set on another. This cuts token cost and latency while keeping recall across sessions.

Between April 13 and June 15, 2026, four separate companies shipped a managed memory layer for AI agents. Cloudflare, Weaviate, Microsoft, and Mem0 all decided, within a 60-day window, that AI agent memory was a product category worth owning. That is not a coincidence. It is a land grab.

The pitch is the same everywhere: your agent forgets between sessions, stuffing full history into the context window is expensive and degrades recall, so let us hold the memory for you. The differences are in what actually shipped, what it costs, and whether the benchmark numbers on the landing page survive contact with an independent harness.

Mostly they don't.

This piece sorts the wave. What's GA versus beta, what locks you in, and which numbers to trust.

TL;DR

Four managed agent-memory layers launched in seven weeks of mid-2026: Weaviate Engram (GA), Cloudflare Agent Memory (private beta), Microsoft Foundry Memory (preview), and Mem0 (GA, with a contested benchmark report). The architecture has converged on async write plus retrieval-first recall.

The benchmarks have not converged at all. Independent reproductions land 25 to 37 points below vendor headlines, so pick on runtime fit and lock-in, not leaderboard rank.

Key takeaways

Convergence on architecture: every launch uses an asynchronous extract-and-store write path plus a retrieval-first recall path, deliberately avoiding context stuffing.
Divergence on maturity: Engram is GA with a free tier; Cloudflare is free private beta; Foundry is preview; Mem0 is GA.
The benchmark gap is the story. Mem0's self-reported 94.4% on LongMemEval reproduces at 57.5% on Maximem's open harness, a 36.9-point gap.
Pick by runtime and portability, not by leaderboard. Lock-in profiles differ sharply.
No canonical benchmark exists. Four instruments compete (LongMemEval, LoCoMo, BEAM, plus Zep's DMR), with no agreement on which is authoritative.

What is an AI agent memory layer?

An agent memory layer is a managed service that converts an agent's raw, noisy event stream into structured, durable, scoped facts you can retrieve later. Writes happen on one path (extract, deduplicate, persist); recall happens on another (retrieve a small, relevant set on demand).

The point is to keep cross-session recall without paying to replay the whole transcript every turn.

That design exists because long context alone fails at memory. The LongMemEval paper (Wu et al., ICLR 2025) found commercial chat assistants show a roughly 30% accuracy drop when they have to remember information across sustained interactions, and long-context LLMs drop 30 to 60% on the LongMemEval_S set.

State-of-the-art systems like GPT-4o land at only 30 to 70% accuracy in a setting simpler than the full benchmark. A bigger context window does not fix forgetting.

What actually shipped in mid-2026

Here is the verified state as of June 18, 2026. Stages and dates are from first-party sources where available.

Product	Stage	Launch / GA	Pricing	Runtime / lock-in
Weaviate Engram	GA	June 3, 2026	Free Forever tier; paid from $45/mo (parent cloud)	Weaviate Cloud; Python SDK + REST
Cloudflare Agent Memory	Private beta	April 13, 2026	Not billed during beta	Cloudflare Workers binding
Microsoft Foundry Memory	Public preview	April 29 / June 3, 2026	No published per-op rate	Azure Foundry Agent Service
Mem0	GA	Report April 1, 2026; SDK v2.0.7	Free / $19 / $79 / $249 mo	Apache 2.0 OSS core + hosted

Weaviate Engram is the most shipped of the four. It went GA on June 3 with a Free Forever tier and a weaviate-engram Python SDK. Writes return a run_id immediately, then an async extract-transform-commit pipeline deduplicates and reconciles facts in the background. Recall runs on Weaviate's existing hybrid vector plus BM25 retrieval. There is no published LongMemEval or LoCoMo score.

Cloudflare Agent Memory landed in private beta during Agents Week. It is profile-scoped and namespace-separated, exposed as a Workers binding with ingest, remember, recall, list, and forget operations. Under the hood it uses Llama 4 Scout 17B for extraction and Nemotron 3 120B for synthesis, with a five-channel retrieval fused by Reciprocal Rank Fusion. Cloudflare is explicit that it is not billing during beta and will give 30 days' notice before charging.

Microsoft Foundry is the enterprise play. Agent Framework 1.0 hit GA on April 3, unifying Semantic Kernel and AutoGen. Memory in Foundry Agent Service entered public preview April 29, and at Build 2026 (May 19 to 22) Microsoft detailed Procedural Memory: it ingests and audits successful agent trajectories, then reuses those patterns to skip steps that previously failed. Hosted Agents GA is targeted for early July 2026.

Mem0 is the incumbent the others are circling. Its open-source core is Apache 2.0, the hosted platform runs from a free Hobby tier up to $249/mo Pro, and the Python SDK reached v2.0.7 on June 17. Mem0 raised roughly $24M and claims to be the exclusive memory provider for AWS's Agent SDK. It also published the most-cited and most-contested document in the space.

Why don't the benchmark numbers add up?

Because almost none of the headline scores survive an independent same-harness run. This is the single most important thing a practitioner needs to know before believing any agent-memory landing page.

Mem0's State of AI Agent Memory 2026 report claims 92.5 on LoCoMo and 94.4 on LongMemEval at about 6,900 tokens per query, with +29.6 points on temporal reasoning and +23.1 on multi-hop. That report is authored by "Engineering Team," is not peer-reviewed, and shipped without a public harness or test logs.

When independent evaluators ran the same systems, the numbers collapsed. Maximem's open harness put Mem0 at 57.5% on LongMemEval. The independent memnode.dev reproduction put Mem0 at 66.9% on LoCoMo. Both gaps are enormous.

Mem0: vendor-claimed vs independently reproduced

Tellingly, the reproductions line up with Mem0's own 2025 paper. The original Mem0 paper reports a J-score of 67.13 on LoCoMo single-hop, far closer to 66.9% than to 92.5%. The most likely explanation is that the 2026 headline uses a different harness slice or judge than the paper did.

This isn't a Mem0-only problem. The AgentOS transparency audit found that Mem0 publishes Zep at 65.99% while Zep publishes itself at 75.14%, and reproduced Zep's self-reported 71.2% on LongMemEval at 63.8%.

The audit also flagged that LoCoMo ships with a roughly 6.4% wrong answer key, and the LLM judge accepts about 63% of wrong answers. The measurement instrument itself is leaky.

For the record, Zep's own paper reports 94.8% versus MemGPT's 93.4% on DMR, a benchmark Zep introduced, so treat that as a vendor-side number too.

What this means for you

Stop ranking vendors by their published LongMemEval score. The absolute number tells you almost nothing without the harness, the answer model, and the judge. The only same-harness multi-vendor comparison with public logs right now is Maximem's, and on it the ordering inverts the marketing.

Choose on three durable axes instead: where your agent runs, how portable your memories are, and what latency your UX tolerates. The latency budget framing is useful here: under 300ms feels responsive, over 3s feels slow.

A memory system is several different operations, each with its own physics, so test recall latency on your own data.

Pick this if

You deploy on Cloudflare Workers: use Cloudflare Agent Memory. It is free during beta and native to the runtime. Accept that namespaces and bindings are Cloudflare-specific with no documented export.
You want portability and no lock-in: use Mem0's Apache 2.0 core. Self-host it, export your memories, ignore the headline benchmark and test on your workload.
You already run Weaviate: add Engram. Lowest friction, GA today, hybrid retrieval you already understand. Python-only SDK for now.
You are Azure-native enterprise: use Microsoft Foundry Memory. Procedural memory and Entra identity integration matter more here than a leaderboard rank, and storage stays in your Azure tenant.

What would change my mind

A published, independently reproduced same-harness benchmark with open logs showing any vendor clearing 85% on LongMemEval would shift this analysis. So would a credible standard replacing LoCoMo, whose answer-key and judge flaws are now documented. Hindsight argues that LoCoMo and LongMemEval are already obsolete in the 1M-token era and is pushing successors like BEAM, which tests up to 10M tokens.

Watch whether the field adopts one. Until it does, trust your own harness over anyone's report.

AI Agent Memory Got Crowded in 2026. Here's What Actually Shipped