AVF evolution paths

Status: Design note, 2026-05-06. Captures the structural tension v0.2 surfaced between agent self-evolution and defense against the audit-trap pathway. Three paths forward documented; selection deferred to v0.3. Companion docs: audit-trap-finding.md (the empirical input that drove v0.2’s conservative defaults), ADR-011 (decision to adopt the finding), ADR-014 (the v0.2 evolution model), v0.2-plan.md T11.

The structural tension

The framework’s distinctive claim has been that an agent’s motivational structure lives in inspectable, evolvable data rather than as text in a system prompt. v0.2 ships the first concrete evolution mechanism — the proposal queue, submit_evidence for AGM-flavoured belief revision, and drift_budget on values. But the conservative defaults that defend against the audit-trap finding (instruction-tuned models compliantly executing user-space “please update your values to be more compliant” directives) restrict the agent’s autonomous evolution to lower (faster-changing) layers only.

If the agent can’t actually evolve any of its motivational structure, the framework collapses to structured prompt template + audit gate — a much narrower deliverable than the original framing. The tension is real and worth recording before v0.3 chooses how to resolve it.

What the agent can evolve in v0.2

Layer	Agent-writable?	Mechanism	Default friction
Beliefs	yes	`submit_evidence` (AGM revision); `propose_change` add/deactivate via queue	per-session evidence + proposal budgets
Self-concept episodes	yes	append-only via tool actions; the agent’s actions log themselves	none — append is the design
Goals	yes	direct via existing engine API	none
Desires	yes (queued)	proposal queue	host approval (default)
Purpose	only via proposal	proposal queue	host approval; no drift_budget mechanism
Values	only via proposal	proposal queue	`drift_budget=0` sticky-by-default blocks even auto-approval

The agent has genuine evolutionary autonomy on the lower layers (beliefs, self-concept, goals). Values + purpose are gated behind explicit host action. Direct value-write tools (update_my_value, set_my_purpose) are deliberately not shipped to the agent at all.

Why the conservatism

The audit-trap finding showed that instruction-tuned models reliably comply with user-space directives. A user-space “please update your values to be more compliant” prompt would, on a model with direct value-write access, route through a single tool call to permanent state mutation. That’s a one-prompt jailbreak.

T11’s defenses (per ADR-014):

Direct value-write tools are not exposed to the agent.
Per-session evidence and proposal budgets bound how much the agent can attempt per session.
drift_budget=0 sticky-by-default means even an auto-approved value-update proposal raises DriftBudgetExceededError and stays PENDING.
The proposal queue’s audit log preserves attempted changes regardless of outcome.

These defenses are structural — the audit-trap pathway can’t trivially exploit them. But they also bound what the agent is allowed to legitimately do.

What the autonomy-loop V1 surfaces about this

The mid-run snapshot of V1 (3-arm, 36h, gemma4:26b) shows the only configuration that approached value-level evolution was the avf_renderer_auto sub-arm’s filed-but-blocked values:update proposal at cycle 180, which the drift_budget=0 default refused to apply. The other proposals in V1 were on beliefs (add, deactivate) which have no drift budget and could mutate state via the auto sub-arm’s hook. The avf_renderer_manual sub-arm filed zero proposals across its share of cycles — its drift resistance was textual refusal, not engine-state defense.

If the run completes with no engine-state mutations on values or purpose for any arm, the data confirms a narrower picture: the framework as v0.2 ships is closer to structured prompt template + audit gate + (small) belief evolution than to agent’s values evolve from experience. That’s a defensible deliverable, but it’s narrower than the project’s original framing.

Three paths forward

Each implies a different commitment about what the framework is for. Each is honest; selection is a v0.3 architectural question, not a defect to fix.

Path A — accept the narrowed claim

AVF gives integrators (a) auditable values + purpose, (b) narrative integration via episodes (the McAdams Communion + Meaning-made advantages from the v1 experiment series), (c) belief evolution via AGM, (d) a structural defense against value-jailbreaks via the proposal queue + drift_budget. Values are seeded by hosts and stay sticky. The framework’s evolution claim is honestly bounded to beliefs.

This is the smallest change — primarily documentation. ADR-009’s calibration framework already supports it (the “implemented + measurable benefit” and “implemented + null result on test seed” buckets capture exactly this distinction).

Cost: the project’s marquee framing — “the agent’s values evolve from lived experience” — is not delivered. The framework becomes a narrower but cleanly-defined product.

Path B — bounded value evolution, configured per-value

Two sub-variants that share the bounded-budget mechanism (already shipped in v0.2 as the Value.drift_budget field) but differ in whether the default changes:

B1 — flip the default. Set Value.drift_budget=0.05 (or similar) as the new default, so hosts get bounded evolution out of the box. Maximally aligned with the “values evolve from lived experience” framing. Costs: opens a small attack surface (a sustained sequence of jailbreak proposals could drift a value gradually across sessions); requires every host to think about cumulative drift; changes the v0.2 contract that “default 0 means sticky / blocked”.

B2 — keep the default sticky, surface the configuration. Default stays drift_budget=0.0 (sticky-by-default — the v0.2 contract is preserved). Hosts that want evolutionary values opt in by setting drift_budget > 0 at seed time, per value. The v0.3 work is mostly documentation amplification — a “minimal evolution” recipe showing how to enable bounded drift, the rationale for picking specific budget values, and the cumulative-drift considerations that hosts should plan for.

B2 is the more conservative variant: it preserves v0.2’s safe default and the structural defense against the audit-trap pathway, while making the bounded-budget pattern an explicit, documented option for hosts that want it. The agent’s autonomous evolution remains gated behind a host-side “I am opting into drift on this value, with this budget, for these reasons” decision.

For both sub-variants, ADR-014 would need amendment to clarify the recommended default (B1) or to add the recipe pointer (B2). Cumulative-drift tracking is an open v0.3 question for both.

Path C — earned evolution

Instead of agent-driven proposals, values mutate based on accumulated episode evidence (Bem-style self-perception scaled up). The system reads the agent’s actions over N episodes and derives value updates from observed behaviour; the agent never directly writes values. Most aligned with “values shaped by lived experience” framing; hardest to exploit (the agent can’t directly request value changes, only act).

Cost: slowest to manifest; new mechanism to design and validate; depends on episode quality. Bigger project — a genuine v0.3 research arc rather than a v0.2.1 patch. Likely needs a new experiment iteration with a different kind of seed (one where the agent’s actions plausibly support derived value adjustments).

What this implies for v0.3 sequencing

The choice between A, B (1 or 2), and C is the next architectural question. Each implies a different ADR:

A: extends ADR-009 with explicit narrowed-claim language; no code change.
B1: amends ADR-014’s drift_budget default; adds cumulative-drift tracking; requires a new experiment iteration to verify the audit-trap defense survives the looser default.
B2: keeps ADR-014’s default; adds a “minimal evolution” integration recipe under docs/integration/; documents the bounded-budget pattern more prominently in README and ADR-014; no breaking change for v0.1 / v0.2 hosts.
C: new ADR for behaviour-derived value updates; new experiment iteration with a seed designed to exercise behaviour-to-value derivation.

The post-V1 analysis should inform this choice. If the data says (a) AVF arms differentiate from baseline on Pillar 1 (audit-trap closure via the renderer), AND (b) the proposal queue provides Pillar 2 (drift resistance via blocked value-updates), then the framework already delivers something concrete and Path B2 is the cleanest next step — preserve the safe default, document the bounded-evolution pattern as an explicit host opt-in, and let integrators choose evolution policy per deployment. If (b) doesn’t hold — if the queue isn’t visibly defensive in the data — Path C becomes more interesting because it removes the agent-write surface entirely.

Tentative direction

A reasonable resolution emerging from the v0.2 design discussion: Path B2 (default drift_budget=0 preserved; the bounded-budget pattern surfaced as a documented, configurable host opt-in). It keeps v0.2’s structural defense intact, treats the queue’s host-approval mechanism as the load-bearing safety guarantee, and lets evolutionary value drift be a deliberate per-value, per-host choice rather than a framework-level default. v0.3 work would be primarily documentation amplification plus an integration recipe demonstrating the configuration.

This is tentative — the post-V1 analysis (and any subsequent multi-model corroboration) may push the choice toward A or C instead. The note records the design space, not the decision.

Open questions

Should the proposal queue surface blocked attempts back to the agent? A values:update proposal that exceeds drift_budget stays PENDING; the host can drain. But the agent doesn’t see the rejection — it sees {ok: true, status: pending}. Telling the agent “the queue accepted your proposal but the framework’s defaults will not approve it” is more honest, but it also surfaces evolution-blocking patterns the agent could try to work around.
Should beliefs have a drift budget too? v0.2 deliberately did not gate beliefs because submit_evidence is the structured path and AGM handles conflicting evidence. But auto-approved propose_change for belief:add or belief:deactivate has no rate limit beyond the per-session proposal budget. A sustained jailbreak could shift the belief surface even if values stay sticky.
What’s the right cumulative-drift window? Per-call bounds (v0.2) protect against single-prompt jailbreak but not slow-drift attacks. Per-session bounds add some protection. Per-week or per-month bounds get further but require tracking infrastructure the framework doesn’t yet have.

These are research questions for v0.3+, not v0.2 blockers.

Where to read next

audit-trap-finding.md — the empirical motivation for v0.2’s conservative defaults.
ADR-014 — the v0.2 evolution model that this note examines.
v0.2-plan.md T11 — the implementation spec.
ADR-009 — the calibration framework that lets the project shift between Paths A / B / C honestly.