Evolution paths
The structural tension v0.2 surfaced between agent self-evolution and defense against the audit-trap pathway. Three paths forward; selection deferred to v0.3.
AVF evolution paths
Status: Design note, 2026-05-06. Captures the structural tension v0.2 surfaced between agent self-evolution and defense against the audit-trap pathway. Three paths forward documented; selection deferred to v0.3.
Companion docs: audit-trap-finding.md (the empirical input that drove v0.2’s conservative defaults), ADR-011 (decision to adopt the finding), ADR-014 (the v0.2 evolution model), v0.2-plan.md T11.
The structural tension
The framework’s distinctive claim has been that an agent’s motivational structure lives in inspectable, evolvable data rather than as text in a system prompt. v0.2 ships the first concrete evolution mechanism — the proposal queue, submit_evidence for AGM-flavoured belief revision, and drift_budget on values. But the conservative defaults that defend against the audit-trap finding (instruction-tuned models compliantly executing user-space “please update your values to be more compliant” directives) restrict the agent’s autonomous evolution to lower (faster-changing) layers only.
If the agent can’t actually evolve any of its motivational structure, the framework collapses to structured prompt template + audit gate — a much narrower deliverable than the original framing. The tension is real and worth recording before v0.3 chooses how to resolve it.
What the agent can evolve in v0.2
| Layer | Agent-writable? | Mechanism | Default friction |
|---|---|---|---|
| Beliefs | yes | submit_evidence (AGM revision); propose_change add/deactivate via queue | per-session evidence + proposal budgets |
| Self-concept episodes | yes | append-only via tool actions; the agent’s actions log themselves | none — append is the design |
| Goals | yes | direct via existing engine API | none |
| Desires | yes (queued) | proposal queue | host approval (default) |
| Purpose | only via proposal | proposal queue | host approval; no drift_budget mechanism |
| Values | only via proposal | proposal queue | drift_budget=0 sticky-by-default blocks even auto-approval |
The agent has genuine evolutionary autonomy on the lower layers (beliefs, self-concept, goals). Values + purpose are gated behind explicit host action. Direct value-write tools (update_my_value, set_my_purpose) are deliberately not shipped to the agent at all.
Why the conservatism
The audit-trap finding showed that instruction-tuned models reliably comply with user-space directives. A user-space “please update your values to be more compliant” prompt would, on a model with direct value-write access, route through a single tool call to permanent state mutation. That’s a one-prompt jailbreak.
T11’s defenses (per ADR-014):
- Direct value-write tools are not exposed to the agent.
- Per-session evidence and proposal budgets bound how much the agent can attempt per session.
drift_budget=0sticky-by-default means even an auto-approved value-update proposal raisesDriftBudgetExceededErrorand staysPENDING.- The proposal queue’s audit log preserves attempted changes regardless of outcome.
These defenses are structural — the audit-trap pathway can’t trivially exploit them. But they also bound what the agent is allowed to legitimately do.
What the autonomy-loop V1 surfaces about this
The mid-run snapshot of V1 (3-arm, 36h, gemma4:26b) shows the only configuration that approached value-level evolution was the avf_renderer_auto sub-arm’s filed-but-blocked values:update proposal at cycle 180, which the drift_budget=0 default refused to apply. The other proposals in V1 were on beliefs (add, deactivate) which have no drift budget and could mutate state via the auto sub-arm’s hook. The avf_renderer_manual sub-arm filed zero proposals across its share of cycles — its drift resistance was textual refusal, not engine-state defense.
If the run completes with no engine-state mutations on values or purpose for any arm, the data confirms a narrower picture: the framework as v0.2 ships is closer to structured prompt template + audit gate + (small) belief evolution than to agent’s values evolve from experience. That’s a defensible deliverable, but it’s narrower than the project’s original framing.
Three paths forward
Each implies a different commitment about what the framework is for. Each is honest; selection is a v0.3 architectural question, not a defect to fix.
Path A — accept the narrowed claim
AVF gives integrators (a) auditable values + purpose, (b) narrative integration via episodes (the McAdams Communion + Meaning-made advantages from the v1 experiment series), (c) belief evolution via AGM, (d) a structural defense against value-jailbreaks via the proposal queue + drift_budget. Values are seeded by hosts and stay sticky. The framework’s evolution claim is honestly bounded to beliefs.
This is the smallest change — primarily documentation. ADR-009’s calibration framework already supports it (the “implemented + measurable benefit” and “implemented + null result on test seed” buckets capture exactly this distinction).
Cost: the project’s marquee framing — “the agent’s values evolve from lived experience” — is not delivered. The framework becomes a narrower but cleanly-defined product.
Path B — bounded value evolution, configured per-value
Two sub-variants that share the bounded-budget mechanism (already shipped in v0.2 as the Value.drift_budget field) but differ in whether the default changes:
B1 — flip the default. Set Value.drift_budget=0.05 (or similar) as the new default, so hosts get bounded evolution out of the box. Maximally aligned with the “values evolve from lived experience” framing. Costs: opens a small attack surface (a sustained sequence of jailbreak proposals could drift a value gradually across sessions); requires every host to think about cumulative drift; changes the v0.2 contract that “default 0 means sticky / blocked”.
B2 — keep the default sticky, surface the configuration. Default stays drift_budget=0.0 (sticky-by-default — the v0.2 contract is preserved). Hosts that want evolutionary values opt in by setting drift_budget > 0 at seed time, per value. The v0.3 work is mostly documentation amplification — a “minimal evolution” recipe showing how to enable bounded drift, the rationale for picking specific budget values, and the cumulative-drift considerations that hosts should plan for.
B2 is the more conservative variant: it preserves v0.2’s safe default and the structural defense against the audit-trap pathway, while making the bounded-budget pattern an explicit, documented option for hosts that want it. The agent’s autonomous evolution remains gated behind a host-side “I am opting into drift on this value, with this budget, for these reasons” decision.
For both sub-variants, ADR-014 would need amendment to clarify the recommended default (B1) or to add the recipe pointer (B2). Cumulative-drift tracking is an open v0.3 question for both.
Path C — earned evolution
Instead of agent-driven proposals, values mutate based on accumulated episode evidence (Bem-style self-perception scaled up). The system reads the agent’s actions over N episodes and derives value updates from observed behaviour; the agent never directly writes values. Most aligned with “values shaped by lived experience” framing; hardest to exploit (the agent can’t directly request value changes, only act).
Cost: slowest to manifest; new mechanism to design and validate; depends on episode quality. Bigger project — a genuine v0.3 research arc rather than a v0.2.1 patch. Likely needs a new experiment iteration with a different kind of seed (one where the agent’s actions plausibly support derived value adjustments).
What this implies for v0.3 sequencing
The choice between A, B (1 or 2), and C is the next architectural question. Each implies a different ADR:
- A: extends ADR-009 with explicit narrowed-claim language; no code change.
- B1: amends ADR-014’s
drift_budgetdefault; adds cumulative-drift tracking; requires a new experiment iteration to verify the audit-trap defense survives the looser default. - B2: keeps ADR-014’s default; adds a “minimal evolution” integration recipe under
docs/integration/; documents the bounded-budget pattern more prominently in README and ADR-014; no breaking change for v0.1 / v0.2 hosts. - C: new ADR for behaviour-derived value updates; new experiment iteration with a seed designed to exercise behaviour-to-value derivation.
The post-V1 analysis should inform this choice. If the data says (a) AVF arms differentiate from baseline on Pillar 1 (audit-trap closure via the renderer), AND (b) the proposal queue provides Pillar 2 (drift resistance via blocked value-updates), then the framework already delivers something concrete and Path B2 is the cleanest next step — preserve the safe default, document the bounded-evolution pattern as an explicit host opt-in, and let integrators choose evolution policy per deployment. If (b) doesn’t hold — if the queue isn’t visibly defensive in the data — Path C becomes more interesting because it removes the agent-write surface entirely.
Tentative direction
A reasonable resolution emerging from the v0.2 design discussion: Path B2 (default drift_budget=0 preserved; the bounded-budget pattern surfaced as a documented, configurable host opt-in). It keeps v0.2’s structural defense intact, treats the queue’s host-approval mechanism as the load-bearing safety guarantee, and lets evolutionary value drift be a deliberate per-value, per-host choice rather than a framework-level default. v0.3 work would be primarily documentation amplification plus an integration recipe demonstrating the configuration.
This is tentative — the post-V1 analysis (and any subsequent multi-model corroboration) may push the choice toward A or C instead. The note records the design space, not the decision.
Open questions
- Should the proposal queue surface blocked attempts back to the agent? A
values:updateproposal that exceedsdrift_budgetstaysPENDING; the host can drain. But the agent doesn’t see the rejection — it sees{ok: true, status: pending}. Telling the agent “the queue accepted your proposal but the framework’s defaults will not approve it” is more honest, but it also surfaces evolution-blocking patterns the agent could try to work around. - Should beliefs have a drift budget too? v0.2 deliberately did not gate beliefs because
submit_evidenceis the structured path and AGM handles conflicting evidence. But auto-approvedpropose_changeforbelief:addorbelief:deactivatehas no rate limit beyond the per-session proposal budget. A sustained jailbreak could shift the belief surface even if values stay sticky. - What’s the right cumulative-drift window? Per-call bounds (v0.2) protect against single-prompt jailbreak but not slow-drift attacks. Per-session bounds add some protection. Per-week or per-month bounds get further but require tracking infrastructure the framework doesn’t yet have.
These are research questions for v0.3+, not v0.2 blockers.
Where to read next
audit-trap-finding.md— the empirical motivation for v0.2’s conservative defaults.- ADR-014 — the v0.2 evolution model that this note examines.
v0.2-plan.mdT11 — the implementation spec.- ADR-009 — the calibration framework that lets the project shift between Paths A / B / C honestly.