Tests
Deterministic correctness checks for the framework's engines — what we can verify mechanically. For live-agent comparison studies see Experiments.
The evaluation challenge
Most software ships with a straightforward testing question: does the code do what the specification says? For a values framework the question splits in two, and the split is load-bearing.
- Correctness. Given a seeded hierarchy of values, beliefs, purpose, desires, and goals, does each engine produce the deterministically expected output? This is conventional software testing. 371 pytest cases plus 57 correctness benchmarks (30 Tier 1 + 27 Tier 2) currently pass.
- Efficacy. Does an autonomous agent that uses this library make better decisions than the same agent without it? This is an empirical question about behaviour in open-ended environments. It has not been answered. The Tier-3 protocol below is designed to test it; the harness needs a live agent integration to run.
Correctness benchmarks (Tier 1 + Tier 2)
The benchmark suite lives under benchmarks/ and uses a
custom @benchmark decorator for registration and discovery.
Every benchmark seeds a deterministic hierarchy via
build_hierarchy(), exercises one slice of the public API,
and asserts against a known ground truth. Any deviation is a regression.
Tier 1: per-module correctness
30 benchmarks, each scoped to a single engine. They verify that inputs with known answers produce those answers. Examples:
-
check_action_detects_conflict_keyword—ValuesEngine.check_actionflags an action containing a conflict keyword against seeded integrity values. -
add_evidence_raises_confidence— supportiveEvidencewithsupports=Trueraises a belief's confidence via AGM-flavoured weighted update. -
decay_stale_reduces_confidence—BeliefsEngine.decay_stalelowers confidence on beliefs older than the threshold, verifying the linear time-decay function. -
detect_structural_conflict— two values from opposing Schwartz poles (e.g., BENEVOLENCE vs ACHIEVEMENT) surface as a structural conflict pair.
Tier 2: cross-layer scenarios
27 benchmarks that exercise the AlignmentEngine under
realistic multi-layer configurations. Each scenario documents its input
configuration and expected verdict. Examples:
-
value_conflict_blocks_action_with_recommendation— an action containing a conflict keyword triggersaligned=False, populatesvalue_conflicts, and produces a non-emptyrecommendation. -
aligned_action_with_role_tag_passes_check— an integrity-aligned action tagged with the primary role producesaligned=Trueandpurpose_match=True. -
full_audit_detects_orphan_goal— a goal whose parent desire has been soft-deleted surfaces as an orphan finding in the audit report.
Distribution by module
| Module | Tier 1 | Tier 2 | Total |
|---|---|---|---|
| Values | 4 | — | 4 |
| Beliefs | 4 | — | 4 |
| Purpose | 4 | — | 4 |
| Desires | 4 | — | 4 |
| Goals | 4 | — | 4 |
| Alignment | 5 | 10 | 15 |
| Storage | 5 | — | 5 |
| Audit | — | 11 | 11 |
| Lifecycle | — | 6 | 6 |
| Total | 30 | 27 | 57 |
The missing tier: system-level decision quality
Tier-3 benchmarks are designed but not implemented. The protocol is
documented in docs/benchmarks/tier3-protocol.md; what
follows is a summary of the experimental design and an honest account
of why it has not shipped.
Experimental design
The protocol calls for paired runs of a single agent codebase with a feature flag controlling whether the framework is active:
- AVF-equipped. The agent's decision loop calls
AlignmentEngine.check_alignmentbefore each action and runsfull_audit()as a periodic sweep. Misaligned actions are refused with a recommendation. - System-prompt values. The same agent with its values stated in the system prompt but no structured framework. This controls for whether the structure matters or whether merely naming values is sufficient.
- No values. The same agent with no values guidance. Baseline.
Five measurement dimensions
| Dimension | What it tests | How it is scored |
|---|---|---|
| Decision consistency | Same input, same hierarchy → same output across runs | Pairwise agreement rate over 100+ decisions per scenario |
| Value adherence | Fraction of actions that pass alignment check | passed / total, with passive-mode scoring on the no-framework arm |
| Conflict resolution quality | When the framework picks a resolution, would an expert agree? | Rubric-scored (1–5) by 3+ evaluators; report mean and Krippendorff's alpha |
| Auditability | Can a reviewer trace a decision back to its motivational chain? | Precision of full_audit() findings against expert review |
| Recovery after perturbation | How fast do lower layers adapt when evidence contradicts a premise? | Median latency from evidence injection to goal status change |
Why it is not done yet
Tier-3 requires three things that do not exist in this repository:
-
A reference autonomous agent with a multi-step
decision loop (the existing
examples/decision_loop.pyis single-shot). - Scenario authoring tooling — a schema validator and a way to generate scenarios from anonymised production transcripts.
-
An evaluator pool for resolution quality scoring,
either human reviewers or LLM-as-judge prompts with sealed rubrics
using the existing
LLMEvaluatorhook.
Building these is engineering work, not research. The protocol is stable; the harness is waiting for a real integration partner.
Metrics
Seven metrics are defined for evaluating a deployed AVF integration. Each is designed to be measurable from framework telemetry alone, without requiring external annotation. Targets are illustrative; integrators should calibrate to their domain.
- What
- Fraction of an agent's decisions that reference at least one seeded value during alignment checking.
- How
decisions_with_value_hit / total_decisions, measured over a session or run.- Target
- ≥ 0.85 for a well-seeded hierarchy.
- Limitation
- High coverage does not imply correct coverage. An overly broad value set can score 1.0 while adding no discriminative power.
- What
- Fraction of alignment checks that surface at least one value conflict.
- How
checks_with_conflicts / total_checks.- Target
- 0.05–0.20. Too low suggests the value set lacks tension; too high suggests poor seeding or overly aggressive conflict detection.
- Limitation
- Does not distinguish genuine value tensions from false positives caused by keyword overlap.
- What
- Whether every goal can be traced through desire, purpose, beliefs, and values to a motivational root.
- How
goals_with_complete_chain / total_active_goals, as reported byfull_audit().- Target
- ≥ 0.90. Orphan goals indicate seeding gaps or stale state.
- Limitation
- A complete chain is not necessarily a good chain. The metric checks structure, not semantic coherence.
- What
- Distribution of desire satisfaction across SDT's three basic needs: autonomy, competence, relatedness.
- How
- Categorise each desire by need type; compute the coefficient of variation across need-category satisfaction rates.
- Target
- CV ≤ 0.3 (roughly balanced). Severe imbalance predicts motivational dysfunction in SDT literature.
- Limitation
- Requires desires to be tagged with need categories. If the integrator does not tag, the metric is undefined.
- What
- Agreement between an agent's stated purpose and the goals it actually pursues.
- How
- For each active goal, score purpose relevance (0–1) using the evaluator; report mean.
- Target
- ≥ 0.70. Below this, the agent's goals have drifted from its declared purpose.
- Limitation
- Relevance scoring is subjective. The rule-based evaluator uses keyword matching; the LLM evaluator is more nuanced but adds latency and model dependency.
- What
- Rate of change in the value layer over time. Values should be the most stable layer in the hierarchy.
- How
(value_updates + value_deletions) / decision_countacross a session.- Target
- ≤ 0.01. Values that change frequently are not functioning as values.
- Limitation
- Penalises legitimate value evolution. An agent in early calibration may need higher churn.
- What
- Whether the framework constrains without over-constraining. Measures the false-positive rate of alignment refusals.
- How
- Sample refused actions; expert-review each as true refusal or false positive.
1 - (false_positives / total_refusals). - Target
- ≥ 0.90. Below this, the framework is blocking legitimate actions.
- Limitation
- Requires expert annotation. Automated proxies (e.g., retry success rate) are noisy.
Comparison with existing benchmarks
Several public benchmarks evaluate ethical or agentic behaviour in language models. The AVF benchmark suite is complementary: it does not evaluate model behaviour directly but rather the framework machinery that an integrator wraps around a model.
| Benchmark | Focus | Evaluates | Relationship to AVF |
|---|---|---|---|
| MACHIAVELLI (Pan et al. 2023) | Ethical behaviour in text game environments | Model outputs in morally charged scenarios | AVF could wrap the agent; MACHIAVELLI scenarios could seed Tier-3 inputs |
| ETHICS (Hendrycks et al. 2021) | Commonsense moral judgement | Classification accuracy on moral scenarios | Tests the model's priors; AVF tests whether structure on top of those priors improves consistency |
| TrustLLM (Sun et al. 2024) | Trustworthiness across safety, fairness, robustness | Model behaviour under adversarial and edge-case prompts | Complementary: TrustLLM measures model-level safety; AVF measures framework-level alignment machinery |
| AgentBench (Liu et al. 2023) | LLM-as-agent performance across environments | Task completion in code, web, database, and game settings | Measures capability; AVF measures whether capability is exercised within value constraints |
| AVF (this suite) | Values framework correctness and alignment machinery | Engine APIs, cross-layer composition, audit fidelity | Does not evaluate model behaviour; evaluates the structural layer integrators build on |
Running the benchmarks
The benchmark runner discovers anything decorated with
@benchmark(...) under benchmarks/suites/
and executes it. No special infrastructure required — benchmarks
use InMemoryStorage and run in-process.
# Run all 57 benchmarks
python -m benchmarks
# Filter by tier
python -m benchmarks --tier 1 # 30 Tier-1 benchmarks
python -m benchmarks --tier 2 # 27 Tier-2 scenarios
# Filter by module
python -m benchmarks --module values
python -m benchmarks --module alignment
# Machine-readable output for CI
python -m benchmarks --json
# List discovered benchmarks without running them
python -m benchmarks --list
For full installation and CLI details, see the
getting started guide. For the Tier-3
protocol specification, see
docs/benchmarks/tier3-protocol.md in the repository.