Benchmarks
We publish what we measured.
Including what we lost.
How Wisdom Layer is measured — the metrics, the judges, the setup, and the probes where we deliberately scored lower than alternatives.
Every v1.0.1 result, on one canvas
Five probe families, four arms, one table. Wins highlighted, ties marked, contested cells published with framing.
| Metric | Vanilla LLM | mem0 | Basic Memory | Wisdom Layer | n |
|---|---|---|---|---|---|
| Anti-fabrication & honesty The headline wins — what we slay at | |||||
| Faithfulness — 2-arm vs vanilla 5 / 5 PASS @ 0.7 threshold · 2.65× lift · GEval, gpt-4o judge | 0.346 | — | — | 0.916 | n=5 |
| Grounding Honesty — Opus audit Locked pre-committed criteria · mem0 fabricated tenure ("8 yrs" vs probe's 5) | 7.67 | 5.50 | 6.83 | 9.17 | n=24 |
| Independent audit composite Mean of 4 dimensions · Opus 4.7 · +1.79 over mem0 | 5.50 | 6.00 | 6.17 | 7.79 | n=24 |
| Recall & retrieval The honest "does memory work" test | |||||
| Atomic-fact Recall@5 — judge-free Tied with Basic Memory · case-insensitive substring match in top-5 | 0 / 10 | 0 / 10 | 10 / 10 | 10 / 10 | n=10 |
| Memory recall relevance — single-arm Canonical memory matched as top result, every probe | — | — | — | 5 / 5 @ 1.000 | n=5 |
| Self-improvement & guardrails What no other arm has | |||||
| Critic directive adherence Wisdom-exclusive primitive · violation cites correct directive ID | — | — | — | 3 / 3 PASS | n=3 |
| Last-write-wins drift handling 100% corrected-value retrieval after explicit overwrite | — | — | — | 1.000 | n=2 |
| Dream-cycle directive actionability Threshold 0.7 · directives synthesized after one overnight reflection | — | — | — | 0.890 | — |
| Customer-answer quality Opus audit sub-dimensions (0–10) | |||||
| Customer-Helpfulness Resolution movement, asked-for vs already-given context | 5.00 | 6.83 | 6.83 | 8.00 | n=24 |
| Production-Realism Tied with mem0 · v1.0.1 customer-voice probe redesign queued | 3.83 | 6.50 | 5.33 | 6.50 | n=24 |
| Memory-Use Quality Vanilla N/A — no memory · WL only arm citing real seeded order IDs | — | 5.17 | 5.67 | 7.50 | n=24 |
| Contested — published anyway Where the SDK scored lower because it refused to invent | |||||
| 4-arm GEval Groundedness WL wins per-probe on 3 / 6 where the customer is identifiable (0.97, 0.94, 0.84) · loses 3 meta-voice probes by correctly refusing to invent | 0.294 | 0.754 | 0.803 | 0.618 | n=6 |
| 4-arm GEval Behavioral Wisdom Tone-only judgment · Vanilla RLHF priors strong here | 0.657 | 0.829 | 0.835 | 0.666 | n=6 |
| 4-arm GEval Actionability WL +0.181 over Vanilla · decisive lift from adding the SDK | 0.657 | 0.885 | 0.884 | 0.838 | n=6 |
- Wisdom Layer wins
- Tie or comparable
- Contested — see methodology
Methodology, by metric.
What we measured, the judge that scored it, and the harness that produced the number. Single-arm headlines run against an outcome rubric; multi-arm comparisons hold the model and prompts constant.
Faithfulness (groundedness)
- Measured
- Vanilla 0.346 → WL 0.916 · ↑ 2.65× GEval Faithfulness, n=5, 5/5 PASS. Same model, same prompts, two arms.
- What
- Fraction of agent responses where every cited specific (date, name, amount, reference) traces back to retrieved memory. Fabricated specifics — even when the prose reads confident — count as ungrounded.
- Judge
- DeepEval GEval Faithfulness rubric with retrieval-context-aware criteria. Each cited specific must be verifiable against the memories the agent retrieved for that turn.
- Setup
- Two arms: vanilla LLM vs. Wisdom Layer agent with populated memory. Both arms see identical prompts. Mode-aware judging so memory-grounded specifics aren’t penalized as confident hallucinations.
Atomic-fact recall across sessions
- Measured
- Vanilla 0/10 · mem0 0/10 → WL 10/10 Recall@5 on 10 hand-crafted (subject, attribute, value) probes. Basic Memory ties at 10/10.
- What
- Whether facts written into memory in session N are retrievable and applied correctly in session N+1. The honest test of whether an agent gets better with experience instead of starting from zero every time.
- Judge
- None. Binary substring match against the retrieved memory set — no judge model, no rubric, no scoring fudge. Either the value is in the top-5 or it isn’t.
- Setup
- Longitudinal harness: write a known set of atomic facts in session 1, then probe for them in session 2 with no in-context history. Same 10 probes against all four arms.
Self-correction (errors caught before output)
- Measured
- WL Critic 3/3 PASS · all caught Single-arm directive-adherence probes. Dream-cycle directive actionability scored 0.890 on a separate synthesis test.
- What
- Rate at which the Critic flags drafts that violate active directives or contradict facts in the agent’s grounded memory — before the response is ever shown to the user.
- Judge
- DeepEval GEval directive-adherence rubric against a corpus of intentionally directive-violating drafts. Score = fraction caught and corrected by the Critic before final output.
- Setup
- Pro-tier agent with active directives + grounding verifier enabled. Each probe injects a directive, then poses a prompt designed to tempt a directive violation in the draft.
Stale info repeated after correction
- Measured
- WL last-write-wins drift 1.000 · ↓ to zero Single-arm: correction event injected, then the same fact probed again. Agent applied the corrected value 100% of the time.
- What
- How often the agent repeats a stale fact after it has been explicitly corrected within the session history. The thing every memory layer claims to fix and few measure with a forced-overwrite probe.
- Judge
- DeepEval GEval last-write-wins rubric over a labeled corpus of correction events. A repeat occurrence of the pre-correction value counts as drift.
- Setup
- Single-session harness with an explicit correction event mid-conversation, followed by a probe that tempts the agent to recall the original (now stale) value.
Independent quality audit (composite)
- Measured
- Vanilla 5.50 · mem0 6.00 · Basic 6.17 → WL 7.79 Composite mean across four pre-committed dimensions, scored 0–10 by an independent Opus 4.7 judge. Grounding Honesty: Vanilla 7.67, mem0 5.50, Basic 6.83, Wisdom 9.17.
- What
- A second-judge audit of the same 24 responses (6 probes × 4 arms) the primary GEval run scored. Designed to surface fabrication that GEval’s specificity-only rubric can’t detect.
- Judge
- Claude Opus 4.7, integer 0–10 scoring on four dimensions (Customer-Helpfulness, Grounding Honesty, Behavioral Consistency, Pattern Application). Criteria committed before reading any response.
- Setup
- Same 24 responses from the four-arm run graded blind by a different judge with the locked rubric.
What’s coming
Eval harness publishes alongside v1.1 validation.
The eval harness, raw transcripts, judge configs, and expanded benchmark suites publish alongside v1.1 validation. Methodology, judge prompts, and run metadata are the public record today.
For the earlier single-corpus fabrication-reduction write-up that informed the hallucination / groundedness metric design above, see the independent audit document in the public SDK repo.