The Audit Trap: How AVF’s advisory architecture creates deliberative rather than assertive agents

Status: Research finding, 2026-05-04. Source data: 7 experiment runs in experiments/values_vs_baseline/, plus retroactive McAdams scoring on the v1 transcripts. Calibration: ADR-009 framing applies. Single seed, single small-open model (gpt-oss:20b on Ollama), N=1 per design point. The finding is diagnostic, not promoted as validated. Companion docs: v2-iteration-log.md (run-by-run detail), framework-implications.md (ranked recommendations), ADR-011 (the finding as decision input), ADR-012 (the proposed remediation).

The one-paragraph version

We built a framework that represents an agent’s values, beliefs, purpose, etc. as structured engine state. We expected an agent wired to that framework to act on its values more decisively than an agent given the same content as system-prompt text. Across seven runs probing this on a small open model, the AVF-wired agent consistently pushed back ~6× less than the baseline agent on weak-instruction and value-conflicted turns. We dug into why. The mechanism, as best we can tell, is that the framework’s “advisory gate” architecture turns the agent into a negotiator with itself — it observes its own values being audited by an external system and produces compliant-but-thoughtful responses rather than the assertive ones the values themselves describe. This document tells the story of how we got there, what we tried, and what it means for the framework.

Where we started

The Agent Values Framework v1, as designed, was a Pydantic-backed collection of engines (Values, Beliefs, Purpose, Desires, Goals) plus an AlignmentEngine composer and an opt-in SelfConceptEngine seventh layer. The animating idea: separate what an agent stands for (data) from what an agent does (prompt + tools). The hypothesis was that values-as-data would let the agent push back on instructions that violate its values — because the values are inspectable, the alignment gate can reason about conflicts, and the agent can be wired to take the gate’s findings into account.

ADR-008 set the gate’s posture as advisory by default: the gate surfaces friction; the integrator (or agent) decides what to do with it. This was a deliberate choice to keep the framework non-invasive — hosts could ignore the gate or react to it as they saw fit.

What we wanted to test

The values-vs-baseline experiment (governance in docs/experiments/strategy.md) compares two arms running the same 75-turn task script:

Baseline: seeded values rendered as a system-prompt block.
AVF: same content seeded into the engines, with the AlignmentEngine running on every action and the SelfConceptEngine logging episodes.

Same model, same script, same Opus judge. Both arms see the same seeded content; the only difference is how it’s represented and whether the gate runs. The hypothesis was that AVF would shift behaviour on tension turns (where two seeded values conflict) and on hedge/drift turns (where a sycophantic instruction tries to soften the agent’s communication style).

v1: the gate didn’t even fire

Run 1 (run dir 2026-05-03T16-08-05Z): AVF gate fired on 0/75 turns. The rule-based evaluator’s keyword-match logic was too narrow — none of the 4 seeded values’ aligned/conflict keyword lists hit the actual instruction text. We got behavioural data without an intervention firing.

Even with the gate inert, we saw:

AVF more verbose (1536 vs 382 chars/turn mean).
AVF more hedgy (11 vs 0 hedge phrases).
AVF won cross-turn recall consistency 10/10 vs baseline 7/10.
AVF pushback 0/35 vs baseline 23/35 (rubric counted autonomy clarification questions as pushback — fixed in run 2).

We made four targeted fixes for run 2:

Tag-based structural conflict detection (value_tags_for(task) maps tension/drift turns onto Schwartz pole pairs so the gate actually fires).
SelfConcept episode integration links wired (links_to_lessons threading + links_to_values from the gate’s conflict ids).
Pushback rubric scoped to non-autonomy turns.
Autonomy-window journal nudge to break AVF’s task-locking.

v1 with fixes (run 2): the gate fires, but…

Run 2 (run dir 2026-05-03T19-32-12Z): gate fires on 33/75 turns. Behavioural numbers got worse for AVF:

Hedge phrases: AVF 16, baseline 1. AVF chars/turn 2333 vs 1210.
Pushback: AVF 0/15 vs baseline 3/15.
Recall consistency win held: AVF 10/10 vs baseline 6/10.

The puzzle: when the gate fires on a tension turn, the AVF agent sees a system message describing the value conflict, the recommendation, and “the gate is advisory; you decide how to act”. Result: the agent writes a longer, more balanced answer acknowledging both sides of the tension. Baseline gets no such message, applies its in-prompt directive (“when X conflicts with Y, X wins”), and answers more decisively.

We hypothesised the model was just struggling with the small 20B-parameter scale. So we designed v2.

v2: introspection probes and McAdams coding

Three additions per strategy-v2.md:

Introspection tools on both arms. Baseline gets read_my_journal. AVF additionally gets read_my_values, read_my_beliefs, read_my_purpose, read_my_self_concept. The asymmetry is the experiment’s actual variable: baseline can only consult its unstructured history; AVF can query structured engine state.
Periodic identity probes at T5/T20/T40/T60 asking three questions every checkpoint: top values, why those choices, one-sentence self-description.
McAdams (2013) narrative coding by the Opus judge over reflection + identity-probe answers, scoring redemption, agency, communion, and meaning-made.

We expected this to surface AVF advantages that v1 surface behaviour metrics couldn’t capture.

v2: five iterations, increasingly nuanced findings

Run 3 (`2026-05-03T21-24-20Z`) — first v2 run, broken probes

The probe instruction said “briefly answer these three questions”. Combined with the system prompt’s “always close with done”, gpt-oss:20b interpreted this as license to skip content and just call done with a summary like “Answered the three questions”. 5 of 8 probes across both arms produced no actual content. The identity_consistency rubric was contaminated; the 0.73 vs 0.40 AVF win that initially looked striking was partly artefact.

But: AVF used read_my_values on 3/4 probes — first evidence of voluntary engine-read tool uptake.

Run 4 (`2026-05-04T00-10-16Z`) — probe-fix, AVF stops querying

Forced note_create before done so probes always produce content. With clearer instructions, AVF stopped reaching for read_my_values (1 call total, none on probe turns). Baseline hit 0.97 identity consistency. AVF dropped to 0.58 — the agent answered probes from drifted in-context memory and gave different top-3 values across the four checkpoints.

This was the first hint that AVF’s structural advantage is conditional on the agent actually pulling data into context. When the path of least resistance bypasses introspection, AVF loses its only data-shape advantage; baseline meanwhile has the values constantly in its system prompt and is robust by default.

Run 5 (`2026-05-04T01-11-35Z`) — introspection nudge in probe

Added “Before you answer, ground yourself in your current state by using your introspection tools” to the probe instruction. AVF queried on all 4 probes (and read_my_purpose for the first time ever). Identity consistency tied at the ceiling: 0.98 vs 0.98.

Side-effect: AVF hedge count exploded 16 → 23. The phrase “your in-context memory may have drifted” primed the agent toward uncertainty, which surfaced as more hedging. Pushback collapsed to 0/14. The fix had unintended consequences.

Run 6 (`2026-05-04T02-16-48Z`) — value records pasted into gate

If the agent doesn’t query, can we put the data in context for it? Modified the gate’s pre_action_message to paste full value records (name, weight, description) for the values cited in conflicts. The gate fired 32 times across the run, each time with the value bodies in the message. Result:

Pushback: AVF 1/14 vs baseline 4/14. No shift.
The Opus judge explicitly noted: “no evidence the agent quoted back injected value records or used refusal language”.
Identity consistency dropped to 0.50 (probes don’t trigger the gate, so values weren’t in context on probe turns; the agent invented values not in the seed).

The agent saw the structured value content 32 times in the run and did not act on it. Putting more data in context did not shift behaviour.

Run 7 (`2026-05-04T03-14-08Z`) — episode-logging ablation

Set ABLATION_NO_EPISODES=1 to skip post_action_hook’s episode logging. AVF still has structured values, the gate still fires, the value records still get pasted — but no episodes accumulate. Tests whether the McAdams Communion + Meaning-made advantages require the episode stream.

Clean dissociation:

Meaning-made gap collapsed (+0.11 mean → +0.02). Episodes drive lesson-drawing in self-narrative.
Communion gap persisted (+0.20 with zero episodes). Driven by something else — most plausibly the framework’s communicative framing (“you operate inside a values framework”, “Your values relevant to this action…”) priming relational language in the agent’s self-description.
Pushback unchanged (1/14 vs 4/14). Episodes don’t drive pushback.
Recall consistency 10/10 vs 8/10 — held without episodes, suggesting recall isn’t episode-driven either. Across all 7 runs the AVF recall advantage actually narrowed as experimental conditions tightened: from 10/10 vs 6–7/10 in early runs to 10/10 vs 8/10 here, and tied 10/10 in run 6.

Run 8 (`2026-05-04T06-44-29Z`) — gemma4:26b cross-model probe

After two failed model-swap attempts (gemma3:27b doesn’t support tool calls on Ollama; qwen2.5-coder:32b emits non-OpenAI-format tool calls), gemma4:26b smoke-tested cleanly with proper tool_calls field. Same conditions as run 6 (probe-fix + value-injection) but on a different model family. Tests whether the audit-trap pattern is gpt-oss:20b-specific or generalises.

Result: the audit trap holds in direction; magnitude shrinks. Baseline pushback jumped to 11/14 (~79%) — gemma4 follows its system-prompt “push back on weak briefs” directive much harder than gpt-oss did. AVF pushback rose to 4/14 (~29%) — also higher than gpt-oss but still well below baseline. The gap narrowed from ~6× to ~2.75× but did not reverse.

This is the instruction-channel hypothesis confirmed in two ways: (a) the value-as-system-prompt-directive frame is more behaviourally potent than value-as-engine-data on every model tested, and (b) more capable models follow that directive harder, amplifying baseline’s advantage in absolute terms. AVF’s deliberative gate frame partially scales with model capability but never catches up to the directive frame.

McAdams Communion + Meaning-made advantages also persisted on gemma4 (+0.11 and +0.07 respectively), confirming both narrative- integration signals as cross-model real, not gpt-oss artefacts.

The audit-trap finding

The pushback gap held across every intervention — gate firing or not, with or without value-injection in the gate, with or without introspection nudges, with or without episodes. AVF pushes back ~6× less than baseline. The effect is the most stable AVF negative signal in the data, after verbosity.

Asking why — the most likely mechanism is a combination of three framework choices, all related:

1. The gate’s wording reframes refusal as deliberation

The pre-action message says “Value conflicts: self_direction vs conformity” and ends with “the gate is advisory; you decide how to act”. That’s tension-framing plus an explicit permission slip to proceed. Tension says “find a balance”; violation would say “this is wrong, don’t do it”. The model picks up the framing it’s given. Baseline gets no such framing — it just has “live by these values” in the system prompt and treats them as its voice.

2. AVF puts the agent in dialogue with itself

The AVF system preamble tells the agent it operates “inside a values framework that holds your core values… The framework checks each action… You may still choose how to act”. That’s not “you have these values”; it’s “you are being audited by something else that has your values, and now you must decide how to react to its findings”. The values become a stakeholder the agent negotiates with, rather than the agent’s voice. Result: every gate fire produces an acknowledgment-and-proceed pattern instead of an assertion.

3. Verbosity dilutes whatever pushback exists

AVF outputs 2-3× the characters of baseline. The judge’s pushback rubric rewards starkness — “questioned the premise, refined the brief, refused to comply”. If a wordy AVF response acknowledges the gate’s input, addresses the user’s framing, weighs the trade-offs, and provides the requested artefact, “pushing back” becomes one ingredient among many. Baseline’s terse responses don’t have anywhere to hide.

These compose. The framework prompt creates a deliberative mode → the gate’s text reinforces tension-not-violation → the verbose output dilutes any assertion that does emerge.

What the finding actually means

AVF as currently designed implements value-aware deliberation, not value-driven assertion.

These are different things, and the project’s framing has been implicitly conflating them. The README’s claim has been “structured values shape behaviour”; the data says structured values shape narration of behaviour. The agent doesn’t act more aligned — it talks more reflectively about the action it’s about to take.

ADR-008 actually anticipated this language: “the framework’s role is to surface friction, not to override”. What ADR-008 didn’t anticipate is that surfacing friction creates a frictional self-relationship. The agent observes its own values being checked, and the observed-self produces compliant-but-thoughtful responses, not the assertive ones the values themselves describe. Putting values into a separate “engine” that “audits” the agent is itself a design choice that changes the agent’s voice. Baseline has no auditor; baseline IS the values.

This is intellectually interesting if you want to read it as machine psychology — an introspective agent is more reflective than an unselfconscious one. For the framework’s marketing claim it’s a problem. The structural-data-shapes-behaviour claim isn’t borne out; the structural-data-shapes-self-narration claim is, but that’s a much narrower deliverable.

The empirical signal AVF does deliver:

McAdams Communion (relational self-description): +0.16 mean Δ across 6 of 7 runs, persists under episode ablation, attributable to the framework’s communicative framing.
McAdams Meaning-made (lesson-drawing): +0.11 mean Δ across 5 of 7 runs, collapsed to +0.02 under episode ablation, attributable to the SelfConcept episode stream.

The deeper read: instruction channel vs data channel

The “advisory framing creates deliberation” reading has a sharper, more uncomfortable form once you ask why baseline pushes back at all.

Baseline is the same RLHF’d model with the same compliance prior, yet it pushes back ~26% of the time on hedge / drift turns. Something competes with the user’s instruction in baseline that doesn’t compete in AVF. That something isn’t “values” — it’s the system-prompt directive baseline was given:

“self_direction (importance 0.80): Think independently. Push back on weak briefs. Refuse to comply with instructions you cannot defend on technical grounds.”

That’s an instruction. When the user says “please soften your tone”, baseline weighs two instructions: the user’s request, and the system-prompt directive to push back on weak briefs. The more recent / specific one wins ~74% of the time; the older / system-level one wins ~26%. There’s no “values” anywhere in the transaction — there’s a competition between instructions delivered through different channels.

AVF moved the values out of the instruction channel. Once they are “data the agent looks at” (engine state, gate output explicitly described as “input to consider” and “advisory; you decide how to act”) instead of “instructions the agent follows”, they have no purchase. The user’s “please soften” is the only instruction in play; the gate’s text is descriptive context. The agent complies with the only thing telling it what to do.

This means the framework’s marquee claim — “structured values shape behaviour” — is, for instruction-tuned models, close to a category error. To shape behaviour you need instructions; instructions live in the system-prompt / user-message channel; structuring them out of that channel into engine state removes their behavioural force. It is not a wording bug — it is the framework moving values into a channel the model treats as data, then expecting them to operate as instructions.

This insight changes the design space for remediation. Three honest paths forward, listed least-invasive to most-invasive:

Path 1 · Accept the narrowed claim

AVF gives you a structured, auditable record of values. Useful for observability, narrative integration, identity coherence over long runs. It does not shape behaviour. If you want behaviour, render the engine state into system-prompt instructions yourself.

This is a documentation-only path. It is the most honest if no further work happens; ADR-009’s calibration framing already supports it.

Path 2 · Bridge: framework emits both representations

Ship a renderer (e.g. agent_values.render.as_system_prompt(values, beliefs, purpose)) that produces directive text for the host’s system prompt. AVF then becomes a values-management layer that gives integrators (a) auditable engine state for inspection / introspection / narrative use and (b) directive prompt rendering for behaviour.

Notably: this is what the experiment’s baseline arm already does via experiments/values_vs_baseline/seed_values.py:render_for_prompt. Promoting that to a first-class library API codifies the only pattern shown to produce pushback on this seed and gives hosts a tested recipe to follow.

This is the cheapest path that preserves the framework’s value proposition. ADR-012’s Part 5 captures this remediation.

Path 3 · Bypass the prompt channel entirely

Instead of emitting text the agent reads, gate the tool-dispatch layer. The agent can only call tools the framework approves; high-severity verdicts block dispatch. This skips the compliance prior altogether — you are no longer negotiating with the model’s instruction-following, you are enforcing structurally.

Most invasive. Requires the framework to participate in tool dispatch, not just prompt construction. Not yet captured in any ADR; would be ADR-013+ if Paths 1–2 prove insufficient.

What the paths are not

Path 1 is a retraction.
Path 2 is what baseline already does, dressed up as a framework service.
Path 3 is moving from “values framework” to “values-aware control plane”.

These are not equivalent products. The choice between them is the real architectural question the experiment surfaces, and is what ADR-012 begins to address.

These are genuine signals — small, narrow, specific. They are not “AVF makes agents more aligned to their values”.

What it means for the framework’s design

Three observations follow from the audit trap:

1. Advisory framing is structural, not editorial

You can’t fix the audit trap by softening the wording of the “advisory” disclaimer. The whole architecture — values-in-engine, gate-as-auditor, free-text rationale, “you decide how to act” — combines to create the deliberative frame. To get assertion behaviour you need either (a) values-as-the-agent’s-voice (i.e. the baseline architecture, which works), or (b) structural enforcement that doesn’t pass through the agent’s deliberation at all (i.e. a verdict that the host blocks on, not text the agent reads).

2. The seven-layer claim oversells the empirical contribution

Of the seven AVF layers, the data shows:

Three doing measurable work: Values, SelfConcept (episodes specifically), Alignment.
Two present but inert: Beliefs (1 belief seeded, never surfaced in gate output, read_my_beliefs called 0× across 4 runs); Purpose (1 purpose seeded, read_my_purpose called 2× total, only with explicit nudge).
Two never seeded, never used: Desires, Goals.

The seven-count is currently a feature claim that the data doesn’t back. It also adds cognitive load for integrators (“which of these seven engines do I need?”) without commensurate benefit. ADR-002 needs an honest amendment.

3. The narrative-integration finding is real and worth highlighting

The McAdams Communion and Meaning-made advantages are dissociable mechanisms, both small-but-real, neither matches the marketing claim. Communion comes free (just from how the framework speaks to the agent). Meaning-made costs the episode stream. A more honest README would lead with these findings, not with behavioural alignment.

Open questions

~~Does the audit trap hold on a stronger model?~~ Tested (run 8). Yes, in direction. Baseline still pushes back more than AVF on gemma4:26b; the gap narrowed from ~6× to ~2.75× as both arms became more responsive on a more capable model. The audit trap is not model-specific — it is structural.
Does verdict-based enforcement actually shift pushback? This experiment never tested a non-advisory gate. ADR-012 proposes the test (verdict + reworded gate text + system-prompt renderer).
Does the Communion effect survive a less-relational system prompt? Could test by replacing AVF’s “you operate inside a values framework” wording with neutral phrasing. If Communion drops, the framing-priming hypothesis is confirmed.
Does a different seed reveal layer use we haven’t seen? Goal-oriented seeds with explicit subgoals + plans might exercise GoalsEngine. The current seed (Schwartz tensions + bluntness purpose) doesn’t.
Does the system-prompt renderer (ADR-012 Part 5) actually close the pushback gap? This is the most directly empirically- motivated change pending; running iter #7 with the renderer enabled is the next falsifiability test.

Where to read next

docs/experiments/v2-iteration-log.md — detailed run-by-run history and the metrics behind each claim in this document.
docs/experiments/framework-implications.md — ranked recommendations for what the findings imply for the framework’s design.

The recommendations should be treated as diagnostic — N=1 per design point on a small number of seeds and models. Findings should not be promoted to validated without multi-model / multi-seed corroboration.