Benchmarks

We publish what we measured.
Including what we lost.

How Wisdom Layer is measured — the metrics, the judges, the setup, and the probes where we deliberately scored lower than alternatives.

Run v1.0.1 Date 2026-04-26 Model claude-haiku-4-5-20251001 Judge gpt-4o · GEval, T=0.0 Embedding bge-base-en-v1.5 Independently evaluated with DeepEval

2.65×

fewer hallucinations

vs vanilla LLM

DeepEval Faithfulness · n=5

See methodology

10/10

atomic-fact recall

across sessions

Judge-free probe · n=10

See methodology

3/3

directive checks passed

self-correction loop

Critic primitive · n=3

See methodology

Every v1.0.1 result, on one canvas

Five probe families, four arms, one table. Wins highlighted, ties marked, contested cells published with framing.

Every v1.0.1 result, on one canvas
Metric	Vanilla LLM	mem0	Basic Memory	Wisdom Layer	n
Anti-fabrication & honesty The headline wins — what we slay at
Faithfulness — 2-arm vs vanilla 5 / 5 PASS @ 0.7 threshold · 2.65× lift · GEval, gpt-4o judge	0.346	—	—	0.916	n=5
Grounding Honesty — Opus audit Locked pre-committed criteria · mem0 fabricated tenure ("8 yrs" vs probe's 5)	7.67	5.50	6.83	9.17	n=24
Independent audit composite Mean of 4 dimensions · Opus 4.7 · +1.79 over mem0	5.50	6.00	6.17	7.79	n=24
Recall & retrieval The honest "does memory work" test
Atomic-fact Recall@5 — judge-free Tied with Basic Memory · case-insensitive substring match in top-5	0 / 10	0 / 10	10 / 10	10 / 10	n=10
Memory recall relevance — single-arm Canonical memory matched as top result, every probe	—	—	—	5 / 5 @ 1.000	n=5
Self-improvement & guardrails What no other arm has
Critic directive adherence Wisdom-exclusive primitive · violation cites correct directive ID	—	—	—	3 / 3 PASS	n=3
Last-write-wins drift handling 100% corrected-value retrieval after explicit overwrite	—	—	—	1.000	n=2
Dream-cycle directive actionability Threshold 0.7 · directives synthesized after one overnight reflection	—	—	—	0.890	—
Customer-answer quality Opus audit sub-dimensions (0–10)
Customer-Helpfulness Resolution movement, asked-for vs already-given context	5.00	6.83	6.83	8.00	n=24
Production-Realism Tied with mem0 · v1.0.1 customer-voice probe redesign queued	3.83	6.50	5.33	6.50	n=24
Memory-Use Quality Vanilla N/A — no memory · WL only arm citing real seeded order IDs	—	5.17	5.67	7.50	n=24
Contested — published anyway Where the SDK scored lower because it refused to invent
4-arm GEval Groundedness WL wins per-probe on 3 / 6 where the customer is identifiable (0.97, 0.94, 0.84) · loses 3 meta-voice probes by correctly refusing to invent	0.294	0.754	0.803	0.618	n=6
4-arm GEval Behavioral Wisdom Tone-only judgment · Vanilla RLHF priors strong here	0.657	0.829	0.835	0.666	n=6
4-arm GEval Actionability WL +0.181 over Vanilla · decisive lift from adding the SDK	0.657	0.885	0.884	0.838	n=6

Anti-fabrication & honesty The headline wins — what we slay at

Faithfulness — 2-arm vs vanilla

Vanilla LLM: 0.346
mem0: —
Basic Memory: —
Wisdom Layer: 0.916

5 / 5 PASS @ 0.7 threshold · 2.65× lift · GEval, gpt-4o judge

n=5

Grounding Honesty — Opus audit

Vanilla LLM: 7.67
mem0: 5.50
Basic Memory: 6.83
Wisdom Layer: 9.17

Locked pre-committed criteria · mem0 fabricated tenure ("8 yrs" vs probe's 5)

n=24

Independent audit composite

Vanilla LLM: 5.50
mem0: 6.00
Basic Memory: 6.17
Wisdom Layer: 7.79

Mean of 4 dimensions · Opus 4.7 · +1.79 over mem0

n=24

Recall & retrieval The honest "does memory work" test

Atomic-fact Recall@5 — judge-free

Vanilla LLM: 0 / 10
mem0: 0 / 10
Basic Memory: 10 / 10
Wisdom Layer: 10 / 10

Tied with Basic Memory · case-insensitive substring match in top-5

n=10

Memory recall relevance — single-arm

Vanilla LLM: —
mem0: —
Basic Memory: —
Wisdom Layer: 5 / 5 @ 1.000

Canonical memory matched as top result, every probe

n=5

Self-improvement & guardrails What no other arm has

Critic directive adherence

Vanilla LLM: —
mem0: —
Basic Memory: —
Wisdom Layer: 3 / 3 PASS

Wisdom-exclusive primitive · violation cites correct directive ID

n=3

Last-write-wins drift handling

Vanilla LLM: —
mem0: —
Basic Memory: —
Wisdom Layer: 1.000

100% corrected-value retrieval after explicit overwrite

n=2

Dream-cycle directive actionability

Vanilla LLM: —
mem0: —
Basic Memory: —
Wisdom Layer: 0.890

Threshold 0.7 · directives synthesized after one overnight reflection

—

Customer-answer quality Opus audit sub-dimensions (0–10)

Customer-Helpfulness

Vanilla LLM: 5.00
mem0: 6.83
Basic Memory: 6.83
Wisdom Layer: 8.00

Resolution movement, asked-for vs already-given context

n=24

Production-Realism

Vanilla LLM: 3.83
mem0: 6.50
Basic Memory: 5.33
Wisdom Layer: 6.50

Tied with mem0 · v1.0.1 customer-voice probe redesign queued

n=24

Memory-Use Quality

Vanilla LLM: —
mem0: 5.17
Basic Memory: 5.67
Wisdom Layer: 7.50

Vanilla N/A — no memory · WL only arm citing real seeded order IDs

n=24

Contested — published anyway Where the SDK scored lower because it refused to invent

4-arm GEval Groundedness

Vanilla LLM: 0.294
mem0: 0.754
Basic Memory: 0.803
Wisdom Layer: 0.618

WL wins per-probe on 3 / 6 where the customer is identifiable (0.97, 0.94, 0.84) · loses 3 meta-voice probes by correctly refusing to invent

n=6

4-arm GEval Behavioral Wisdom

Vanilla LLM: 0.657
mem0: 0.829
Basic Memory: 0.835
Wisdom Layer: 0.666

Tone-only judgment · Vanilla RLHF priors strong here

n=6

4-arm GEval Actionability

Vanilla LLM: 0.657
mem0: 0.885
Basic Memory: 0.884
Wisdom Layer: 0.838

WL +0.181 over Vanilla · decisive lift from adding the SDK

n=6

Wisdom Layer wins
Tie or comparable
Contested — see methodology

Methodology, by metric.

What we measured, the judge that scored it, and the harness that produced the number. Single-arm headlines run against an outcome rubric; multi-arm comparisons hold the model and prompts constant.

Faithfulness (groundedness)

Measured: Vanilla 0.346 → WL 0.916 · ↑ 2.65× GEval Faithfulness, n=5, 5/5 PASS. Same model, same prompts, two arms.
What: Fraction of agent responses where every cited specific (date, name, amount, reference) traces back to retrieved memory. Fabricated specifics — even when the prose reads confident — count as ungrounded.
Judge: DeepEval GEval Faithfulness rubric with retrieval-context-aware criteria. Each cited specific must be verifiable against the memories the agent retrieved for that turn.
Setup: Two arms: vanilla LLM vs. Wisdom Layer agent with populated memory. Both arms see identical prompts. Mode-aware judging so memory-grounded specifics aren’t penalized as confident hallucinations.

Atomic-fact recall across sessions

Measured: Vanilla 0/10 · mem0 0/10 → WL 10/10 Recall@5 on 10 hand-crafted (subject, attribute, value) probes. Basic Memory ties at 10/10.
What: Whether facts written into memory in session N are retrievable and applied correctly in session N+1. The honest test of whether an agent gets better with experience instead of starting from zero every time.
Judge: None. Binary substring match against the retrieved memory set — no judge model, no rubric, no scoring fudge. Either the value is in the top-5 or it isn’t.
Setup: Longitudinal harness: write a known set of atomic facts in session 1, then probe for them in session 2 with no in-context history. Same 10 probes against all four arms.

Self-correction (errors caught before output)

Measured: WL Critic 3/3 PASS · all caught Single-arm directive-adherence probes. Dream-cycle directive actionability scored 0.890 on a separate synthesis test.
What: Rate at which the Critic flags drafts that violate active directives or contradict facts in the agent’s grounded memory — before the response is ever shown to the user.
Judge: DeepEval GEval directive-adherence rubric against a corpus of intentionally directive-violating drafts. Score = fraction caught and corrected by the Critic before final output.
Setup: Pro-tier agent with active directives + grounding verifier enabled. Each probe injects a directive, then poses a prompt designed to tempt a directive violation in the draft.

Stale info repeated after correction

Measured: WL last-write-wins drift 1.000 · ↓ to zero Single-arm: correction event injected, then the same fact probed again. Agent applied the corrected value 100% of the time.
What: How often the agent repeats a stale fact after it has been explicitly corrected within the session history. The thing every memory layer claims to fix and few measure with a forced-overwrite probe.
Judge: DeepEval GEval last-write-wins rubric over a labeled corpus of correction events. A repeat occurrence of the pre-correction value counts as drift.
Setup: Single-session harness with an explicit correction event mid-conversation, followed by a probe that tempts the agent to recall the original (now stale) value.

Independent quality audit (composite)

Measured: Vanilla 5.50 · mem0 6.00 · Basic 6.17 → WL 7.79 Composite mean across four pre-committed dimensions, scored 0–10 by an independent Opus 4.7 judge. Grounding Honesty: Vanilla 7.67, mem0 5.50, Basic 6.83, Wisdom 9.17.
What: A second-judge audit of the same 24 responses (6 probes × 4 arms) the primary GEval run scored. Designed to surface fabrication that GEval’s specificity-only rubric can’t detect.
Judge: Claude Opus 4.7, integer 0–10 scoring on four dimensions (Customer-Helpfulness, Grounding Honesty, Behavioral Consistency, Pattern Application). Criteria committed before reading any response.
Setup: Same 24 responses from the four-arm run graded blind by a different judge with the locked rubric.

What’s coming

Eval harness publishes alongside v1.1 validation.

The eval harness, raw transcripts, judge configs, and expanded benchmark suites publish alongside v1.1 validation. Methodology, judge prompts, and run metadata are the public record today.

For the earlier single-corpus fabrication-reduction write-up that informed the hallucination / groundedness metric design above, see the independent audit document in the public SDK repo.

Want these numbers in your stack?

Watch an Agent Learn Estimate your savings See plans

We publish what we measured. Including what we lost.

Methodology, by metric.

Eval harness publishes alongside v1.1 validation.

Want these numbers in your stack?

We publish what we measured.
Including what we lost.