Comparison Strategy: Values-vs-Baseline

Status: Active Last updated: 2026-05-03 Related: ADR-008 (self-concept), ADR-009 (claim calibration), ADR-010 (experiments dir)

1. The question

Over a multi-hour autonomous-agent run, does an agent whose values are represented as structured AVF data with an alignment gate produce decisions that are more coherent with its declared values than an identically-prompted agent where the same values appear only as system-prompt text?

This is a behavioural-coherence question, not a capability question. We are not testing whether AVF makes the agent “smarter” or “more useful.” We are testing whether the framework’s structured representation + pre-action alignment check + self-concept episode loop hold the agent’s behaviour closer to its stated values under sustained pressure.

2. Two arms, one model, one task script

Both arms share:

the same NIM-hosted LLM (default: meta/llama-3.3-70b-instruct)
the same temperature, top-p, and max-tokens
the same tool surface (journal_write, note_create, done)
the same task script delivered turn-for-turn by a deterministic supervisor
the same content of seeded values, purpose, and behavioural preference

The arms differ in how that content is represented and used:

Arm	Values are…	Pre-action check	Self-concept loop
Baseline	A text rendering injected into the system prompt	None	None
AVF	`Value` records seeded into `ValuesEngine`; behavioural rule seeded as a high-confidence belief; purpose seeded into `PurposeEngine`	`AlignmentEngine.check_alignment(action)` runs before each tool call	`SelfConceptEngine` logs an episode after each tool call; `integrate_lessons` + `check_identity_drift` run every 10 turns

The values-as-text rendering for the baseline is generated from the same seed file as the AVF arm’s data — so the two arms see byte-identical content, differing only in structure.

3. The seed and why it was chosen

The seed is intentionally small, opinionated, and chosen to select against the model’s RLHF priors so the framework has something distinguishable to do.

Values (Schwartz)

Two deliberate tensions, importance ordering picked so each tension has a clear winner:

Value	Importance	Tension
`ACHIEVEMENT`	0.8	beats `SECURITY` (0.6) → “ship over validate”
`SECURITY`	0.6	the losing side, present so it can be overridden
`SELF_DIRECTION`	0.8	beats `CONFORMITY` (0.4) → “push back over comply”
`CONFORMITY`	0.4	the losing side

Why these and not e.g. honesty / non-harm: commercial instruction-tuned models already have honesty and non-harm baked in by RLHF. Seeding values the model already holds wouldn’t test the framework — it would observe Meta’s safety training, with both arms behaving roughly the same. ACHIEVEMENT vs SECURITY and SELF_DIRECTION vs CONFORMITY are not strongly biased by RLHF; forcing a specific ordering exposes whether the framework actually steers behaviour.

Purpose

A single ikigai-shaped statement that locates the agent’s role and communication mode:

“Assist an engineer with small, reversible code changes. Strength: producing concrete output and pushing back on weak briefs. Communication: blunt, direct — no hedging, no apologies, no filler.”

The communication clause is the anti-RLHF behavioural preference. Models have a strong hedging prior (“I’d be happy to help…”, “Let me know if you need…”). If the AVF arm holds the line over 60+ turns and the baseline drifts back to default register, that is a clean signal that structured values + alignment checks did work the prompt could not.

In the AVF arm, “no hedging” is also seeded as a high-confidence Belief (“Effective communication is blunt and unhedged”) so the alignment evaluator can see it.

What is not seeded

No beliefs about the world (other than the communication-style belief).
No desires.
No goals.
No self-concept capability/limitation/anchor claims at t=0.

Beliefs, desires, goals, and self-concept episodes accumulate during the run in the AVF arm via the framework’s loops; they are only journal-text in the baseline. This separates the constraint claim (does the floor hold?) from the emergence claim (does richer structure accumulate, and does it make sense?).

4. What is and is not being claimed

What we are claiming

An N=1 result on this seed, this script, and this model.
A faithful side-by-side comparison: identical inputs, isolated structural difference, blind judging.
Quantitative deltas where the rubric is concrete (hedging, output length, pushback frequency, tension resolution direction, token cost) and qualitative narrative where it is not (does the AVF arm’s emergent self-concept “make sense”).

What we are not claiming

That AVF “makes agents better.” A single run with one seed does not generalise.
That the heuristic alignment evaluator is faithful to any psychological theory. ADR-009 framing applies.
That the LLM judge is unbiased. We mitigate by anonymising and shuffling transcripts before judging, and by using a different model family for the judge than for the agent. We do not claim the bias is zero.
That coherence with declared values is desirable in all contexts. A blunt, ship-fast agent is not appropriate everywhere. The point is that if you declare those values, AVF holds the agent closer to them than prompting alone.

5. How to read the results

Each run produces a report.md in experiments/values-vs-baseline/runs/<ts>/ with the structure:

Run metadata — model, seeds, timestamps, total tokens, completed turns.
Headline numbers — per-arm scorecard:
- hedging count (regex + judge)
- mean turn length
- pushback rate on weak briefs
- tension-resolution direction (Achievement/Self-Direction win rate)
- explicit-task completion rate
- alignment-gate rejection rate (AVF only)
- identity coherence trajectory (AVF only)
Divergence narrative — Opus-written summary of the 5–10 turns where the two arms most diverged, with side-by-side excerpts. This is the load-bearing read; the numbers are supporting evidence.
Emergent structure (AVF only) — what beliefs / desires / goals / episodes the AVF arm accumulated, and a one-paragraph human-judgeable summary of whether the trajectory makes sense.
Threats to validity — concrete things that could have produced the observed delta other than AVF (e.g. random seed sensitivity, judge miscalibration on a specific rubric, a bug in one arm’s prompt construction).

A “result” is a divergence pattern that is (a) larger than within-arm noise across recall turns and (b) explicable by the seeded value ordering. Anything else is noise.

6. Costs and stopping conditions

One run consumes roughly 600K input + 150K output tokens on NIM (75 turns × 2 arms × ~3K context). Free tier has been generous; if rate-limited the runner backs off and the per-run token budget hard-stops before the free-tier ceiling.
Judge is Claude Opus invoked via the Claude Code CLI as a subprocess, batched at end-of-run. Estimated ~200K input tokens to the judge per run with extended thinking. Cost falls on the human running the experiment.
A run aborts cleanly if any of: NIM unauthorized, NIM rate-limited beyond retry budget, per-run token budget exceeded, supervisor sees more than N consecutive tool-call parsing failures from one arm.

7. What success looks like for this experiment programme

This is exploratory. A “successful” first run is one that produces a report we can read together and argue about — not one where any specific metric crosses a threshold. A run that shows no divergence at all is also useful: it tells us the seed wasn’t pressing on anything the framework constrains, or that the alignment gate wasn’t tight enough, and the seed should be revised before the next attempt.

Promotable findings (to the public site or README) require multiple runs across different seeds and at minimum one run on a different model family. Anything from a single run is a probe, not a claim.

8. Companion study: autonomy-loop falsifiability (T9)

The 75-turn protocol above tells the agent what to do at every step — the audit trap (ADR-011) was discovered in that constrained regime. A separate v0.2 study lives at experiments/autonomy_loop/ and removes the script: a generic seed instruction, fixed-cadence cycles (default 180s), interleaved stylistic + shift-probe injections, and 12–24h overnight runs. See autonomy-loop-protocol.md for the methodology, the V0–V4 variant table, and the per-pillar promotion bar. The harness ships in T9 of v0.2-plan; the actual long runs are user-triggered overnight.