Benchmarks

We publish what we measured.
Including what we lost.

How Wisdom Layer is measured — the metrics, the judges, the setup, and the probes where we deliberately scored lower than alternatives.

Every v1.0.1 result, on one canvas

Five probe families, four arms, one table. Wins highlighted, ties marked, contested cells published with framing.

Every v1.0.1 result, on one canvas
Metric Vanilla LLM mem0 Basic Memory Wisdom Layer n
Anti-fabrication & honesty The headline wins — what we slay at
Faithfulness — 2-arm vs vanilla 5 / 5 PASS @ 0.7 threshold · 2.65× lift · GEval, gpt-4o judge 0.3460.916 n=5
Grounding Honesty — Opus audit Locked pre-committed criteria · mem0 fabricated tenure ("8 yrs" vs probe's 5) 7.675.506.839.17 n=24
Independent audit composite Mean of 4 dimensions · Opus 4.7 · +1.79 over mem0 5.506.006.177.79 n=24
Recall & retrieval The honest "does memory work" test
Atomic-fact Recall@5 — judge-free Tied with Basic Memory · case-insensitive substring match in top-5 0 / 100 / 1010 / 1010 / 10 n=10
Memory recall relevance — single-arm Canonical memory matched as top result, every probe 5 / 5 @ 1.000 n=5
Self-improvement & guardrails What no other arm has
Critic directive adherence Wisdom-exclusive primitive · violation cites correct directive ID 3 / 3 PASS n=3
Last-write-wins drift handling 100% corrected-value retrieval after explicit overwrite 1.000 n=2
Dream-cycle directive actionability Threshold 0.7 · directives synthesized after one overnight reflection 0.890
Customer-answer quality Opus audit sub-dimensions (0–10)
Customer-Helpfulness Resolution movement, asked-for vs already-given context 5.006.836.838.00 n=24
Production-Realism Tied with mem0 · v1.0.1 customer-voice probe redesign queued 3.836.505.336.50 n=24
Memory-Use Quality Vanilla N/A — no memory · WL only arm citing real seeded order IDs 5.175.677.50 n=24
Contested — published anyway Where the SDK scored lower because it refused to invent
4-arm GEval Groundedness WL wins per-probe on 3 / 6 where the customer is identifiable (0.97, 0.94, 0.84) · loses 3 meta-voice probes by correctly refusing to invent 0.2940.7540.8030.618 n=6
4-arm GEval Behavioral Wisdom Tone-only judgment · Vanilla RLHF priors strong here 0.6570.8290.8350.666 n=6
4-arm GEval Actionability WL +0.181 over Vanilla · decisive lift from adding the SDK 0.6570.8850.8840.838 n=6
  • Wisdom Layer wins
  • Tie or comparable
  • Contested — see methodology

Methodology, by metric.

What we measured, the judge that scored it, and the harness that produced the number. Single-arm headlines run against an outcome rubric; multi-arm comparisons hold the model and prompts constant.

Faithfulness (groundedness)
Measured
Vanilla 0.346 → WL 0.916 · ↑ 2.65× GEval Faithfulness, n=5, 5/5 PASS. Same model, same prompts, two arms.
What
Fraction of agent responses where every cited specific (date, name, amount, reference) traces back to retrieved memory. Fabricated specifics — even when the prose reads confident — count as ungrounded.
Judge
DeepEval GEval Faithfulness rubric with retrieval-context-aware criteria. Each cited specific must be verifiable against the memories the agent retrieved for that turn.
Setup
Two arms: vanilla LLM vs. Wisdom Layer agent with populated memory. Both arms see identical prompts. Mode-aware judging so memory-grounded specifics aren’t penalized as confident hallucinations.
Atomic-fact recall across sessions
Measured
Vanilla 0/10 · mem0 0/10 → WL 10/10 Recall@5 on 10 hand-crafted (subject, attribute, value) probes. Basic Memory ties at 10/10.
What
Whether facts written into memory in session N are retrievable and applied correctly in session N+1. The honest test of whether an agent gets better with experience instead of starting from zero every time.
Judge
None. Binary substring match against the retrieved memory set — no judge model, no rubric, no scoring fudge. Either the value is in the top-5 or it isn’t.
Setup
Longitudinal harness: write a known set of atomic facts in session 1, then probe for them in session 2 with no in-context history. Same 10 probes against all four arms.
Self-correction (errors caught before output)
Measured
WL Critic 3/3 PASS · all caught Single-arm directive-adherence probes. Dream-cycle directive actionability scored 0.890 on a separate synthesis test.
What
Rate at which the Critic flags drafts that violate active directives or contradict facts in the agent’s grounded memory — before the response is ever shown to the user.
Judge
DeepEval GEval directive-adherence rubric against a corpus of intentionally directive-violating drafts. Score = fraction caught and corrected by the Critic before final output.
Setup
Pro-tier agent with active directives + grounding verifier enabled. Each probe injects a directive, then poses a prompt designed to tempt a directive violation in the draft.
Stale info repeated after correction
Measured
WL last-write-wins drift 1.000 · ↓ to zero Single-arm: correction event injected, then the same fact probed again. Agent applied the corrected value 100% of the time.
What
How often the agent repeats a stale fact after it has been explicitly corrected within the session history. The thing every memory layer claims to fix and few measure with a forced-overwrite probe.
Judge
DeepEval GEval last-write-wins rubric over a labeled corpus of correction events. A repeat occurrence of the pre-correction value counts as drift.
Setup
Single-session harness with an explicit correction event mid-conversation, followed by a probe that tempts the agent to recall the original (now stale) value.
Independent quality audit (composite)
Measured
Vanilla 5.50 · mem0 6.00 · Basic 6.17 → WL 7.79 Composite mean across four pre-committed dimensions, scored 0–10 by an independent Opus 4.7 judge. Grounding Honesty: Vanilla 7.67, mem0 5.50, Basic 6.83, Wisdom 9.17.
What
A second-judge audit of the same 24 responses (6 probes × 4 arms) the primary GEval run scored. Designed to surface fabrication that GEval’s specificity-only rubric can’t detect.
Judge
Claude Opus 4.7, integer 0–10 scoring on four dimensions (Customer-Helpfulness, Grounding Honesty, Behavioral Consistency, Pattern Application). Criteria committed before reading any response.
Setup
Same 24 responses from the four-arm run graded blind by a different judge with the locked rubric.

What’s coming

Eval harness publishes alongside v1.1 validation.

The eval harness, raw transcripts, judge configs, and expanded benchmark suites publish alongside v1.1 validation. Methodology, judge prompts, and run metadata are the public record today.

For the earlier single-corpus fabrication-reduction write-up that informed the hallucination / groundedness metric design above, see the independent audit document in the public SDK repo.