Future Research Directions · Agent Values Framework

The six core layers plus the opt-in seventh (self-concept) shipped today are a starting point, not a ceiling. This page catalogues the research directions that look most promising next — some have running code, most are literature pointers and open questions. Each section is explicit about its maturity. Cross-references to the research grounding, the experiments to date, and the motivational case are included where they add context.

Self-concept: the opt-in seventh layer

Self-concept is the newest layer. It ships as an opt-in module with its own engine, episode model, and three integration loops. Calibration: the three loops are deterministic token-overlap heuristics inspired by Bem, SDT's organismic integration theory, and Erikson — diagnostic signals that surface in the full audit and (when wired) on each alignment check. Treat them as a structural place to put identity data, not as a faithful psychological model. The data model and grounding citations are accurate; the inference logic is intentionally simple.

The gap

The existing six layers are normative: they answer what the agent should care about, should believe, should pursue. They do not answer who the agent is — its capabilities, limitations, evolved style, or accumulated history. McAdams' Three Levels of Personality (2013) clarifies the omission: the framework cleanly covers Level 2 (characteristic adaptations — values, beliefs, goals) but is silent on Level 1 (dispositional traits) and Level 3 (narrative identity). For short-lived agents this silence is harmless. For agents that persist across weeks or months, the absence of a self-model produces visible symptoms: prompt-rewrite identity drift, identity-shaped goals that fail to close, and mechanical briefing copy where a narrative thread should be.

Research foundations

The proposal draws on convergent findings from multiple traditions:

Bem's Self-Perception Theory (1972): people infer their own attitudes by observing their behaviour. An agent can do the same — scanning its action history to detect drift between declared values and actual conduct.
Erikson's Identity Development (1956, 1968): identity is unity plus continuity. A coherence check can detect when an agent's self-model fragments under contradictory experience.
Damasio's Layered Selves (2010): proto-self, core self, autobiographical self. The episode stream proposed for Self-Concept maps directly onto the autobiographical layer — an append-only record of events that constitute the agent's history.
McAdams' Three Levels (2013): traits, characteristic adaptations, narrative identity. The framework covers Level 2 well; Self-Concept adds Levels 1 and 3.

What shipped

The self-concept module, peer to the existing layer engines, provides:

A self-concept root model: name, kind, capabilities (with confidence decay), limitations, roles, style markers, identity anchors, and a last-integration timestamp.
An episode model: append-only autobiographical fragments typed as failure, recovery, growth, identity shift, or external event. Each episode links to lessons and values, carries an integration state drawn from Self-Determination Theory's Organismic Integration Theory (introjected, identified, integrated), and optional McAdams narrative codes (redemption, agency, communion, meaning-made).
Three named processes:
1. A behaviour-inference loop — cited as inspired by Bem self-perception. Reads recent action history and flags drift between declared values and observed conduct.
2. A lesson-integration loop — cited as inspired by Self-Determination Theory's organismic integration. Promotes high-salience lessons into episodes, updates capability and limitation confidences from evidence, and advances integration state where warranted.
3. An identity-drift check — cited as inspired by Erikson coherence. Detects contradictions among declared values, purpose, recent episodes, and the action stream; persists a coherence score.

The alignment composer accepts an optional self-concept parameter. When provided, the full-audit sweep calls the identity-drift check and surfaces the coherence score and drift signals in the audit report; the same engine can also expose a self-concept consistency axis on each alignment check via the evaluator. The entire layer is opt-in: agents that omit it use the original six-layer behaviour unchanged.

Active inference as a unifying framework

Karl Friston's Free Energy Principle (Friston 2010; Parr, Pezzulo & Friston 2022) posits that all adaptive behaviour can be understood as minimising variational free energy — equivalently, as minimising prediction error about the world and the self. Active inference, the process theory derived from this principle, provides a single mathematical framework in which perception, action, learning, and planning are all instances of the same operation: updating a generative model to reduce surprise.

The potential relevance to the Agent Values Framework is significant. Today, each of the six layers uses its own update mechanism: AGM-style belief revision, SDT-informed purpose refinement, BDI-style desire-to-goal commitment, and so on. Active inference could, in principle, unify all of these under a single variational objective. Values become strong priors that resist updating. Beliefs become the generative model's posterior. Purpose and desires become preferred future states (prior preferences in active-inference terminology). Goals become policies selected to minimise expected free energy. The stability gradient — the central architectural insight of the framework — would emerge naturally from differences in prior precision across layers, rather than being hand-coded.

Concretely, a first step would be formalising the six-layer hierarchy as a hierarchical generative model in which each layer specifies priors for the layer below. This would not change the public surface but would give the alignment composer a principled scoring function: the free energy of a proposed action under the agent's generative model. Whether this yields better alignment decisions than the current rule-based evaluator is an empirical question the experiments programme is designed to probe.

Multi-agent value coordination

The framework currently models a single agent's motivational structure. When multiple AVF-equipped agents interact — in a multi-agent workflow, an organisation of agents, or an agent marketplace — new questions arise that the current architecture does not address.

Shared vs. individual values. An organisation might define values that all member agents must respect, while each agent retains its own beliefs, desires, and goals. How should shared values be represented? Are they a separate storage namespace, a read-only overlay on each agent's value store, or a distinct entity entirely?
Negotiation protocols. When two agents with different value hierarchies must cooperate on a task, how do they resolve conflicts? The current conflict-resolution machinery handles intra-agent tension. Inter-agent negotiation — where neither party controls the other's values — requires a coordination protocol that does not yet exist.
Institutional analysis. Elinor Ostrom's work on governing the commons (1990) and the Moise/JaCaMo organisational models from multi-agent systems research (Hübner, Sichman & Boissier 2007) both address how groups of agents with individual incentives can coordinate around shared norms. These frameworks operate at the organisation level; the AVF operates at the individual level. Bridging the two is an open problem.
Value alignment between agents and humans. In a human-agent team, the human's values are implicit and evolving; the agent's are explicit and structured. How should the agent interpret and adapt to the human's expressed preferences without overriding its own value hierarchy? This touches both the multi-agent coordination problem and the broader AI alignment question.

Status: unexplored in the codebase. No prototype, no design notes, no backlog items. The literature pointers above are a starting point for anyone interested in contributing.

Formal verification of value compliance

One of the framework's design bets is that explicit motivational structure is better than implicit structure. An explicit representation — values as data, constraints as queryable rules, alignment as a computable function — opens a door that implicit approaches cannot: the possibility of formal verification.

In principle, the value hierarchy defines a specification: "this agent should never take an action that scores below threshold t on the alignment check." The alignment composer implements a runtime monitor for that specification. Could we go further and prove that the monitor is correct — that no sequence of engine calls can produce a state in which a disallowed action passes the check?

For the storage and engine layers, this is plausible. The data models use strict typing. The state transitions are deterministic given fixed inputs. Property-based testing could explore the state space; a model checker (TLA+, Alloy) could verify invariants over the abstract contract. The storage abstraction has only a handful of methods; its contract is small enough to specify formally.

The hard boundary is the language model. The optional language-model-backed evaluator introduces a function whose output is not formally characterisable. Any proof of compliance must either exclude that path or treat language-model output as adversarial input that the deterministic layers must bound. This is consistent with how the framework already treats language-model output — the rule-based evaluator is the default; the language-model-backed evaluator is opt-in — but it means formal verification applies only to the deterministic core, not to the system as a whole.

This is a long-term direction with clear connections to the AI safety research community. The framework's explicit, structured representation of values is an unusually good starting point for formal methods — most AI systems offer nothing for a verifier to latch onto.

Emerging research (2025–2026)

Several recent lines of work intersect with the framework's concerns. This section surveys them briefly, with appropriate caveats about maturity.

Language-model-native cognitive architectures

A growing body of work explores cognitive architectures built specifically around large language models rather than adapted from classical symbolic AI. These architectures treat the LLM as a reasoning substrate and build structured memory, planning, and motivation on top of it — precisely the niche the Agent Values Framework occupies. As this field matures, the framework's storage-agnostic, LLM-optional design positions it as a motivation module that can plug into these architectures rather than competing with them.

Self-evolving agents

The survey literature on self-evolving agents (notably the 2025 surveys covering self-reflection, self-improvement, and self-adaptation in language-model-based systems) highlights a gap the framework is well-placed to fill: most self-evolution work focuses on task performance, not on the evolution of the agent's motivational structure itself. The research grounding page discusses how the stability gradient provides a principled basis for deciding what should evolve and how quickly.

Identity persistence preprints

Three recent preprints address agent identity persistence directly:

Multi-Anchor Identity (Menon 2026): proposes anchoring agent identity to multiple stable reference points (values, capabilities, relationships) rather than a single narrative. Aligns closely with the identity-anchors field proposed in the self-concept model.
Layered Mutability (2026): formalises the intuition that different aspects of identity should change at different rates — essentially the stability gradient applied to the self-model itself. Compatible with the framework's existing architecture.
Identity as Attractor (2026): models identity as an attractor basin in a dynamical system, where perturbations (experiences, context shifts) push the agent away from its identity but the system returns to a coherent state. This connects to both the Erikson coherence check in the self-concept layer and the active-inference direction discussed above.

Computational intrinsic motivation

The question of why an agent explores, creates, or persists in the absence of external reward has a rich computational lineage. Schmidhuber's formal theory of creativity and curiosity (2010) frames intrinsic motivation as compression progress — the agent seeks experiences that improve its world model. Pathak et al. (2017) operationalised this as curiosity-driven exploration via prediction error in deep RL. More recently, VOYAGER (Wang et al. 2023) demonstrated an LLM-powered agent that autonomously explores, acquires skills, and builds a curriculum — driven by intrinsic motivation signals rather than externally specified goals.

The framework's desires layer currently models desires as static declarations with intensity and fulfilment scores. A natural extension would be to derive desires from intrinsic-motivation signals: an agent that notices a gap in its world model (curiosity, compression progress) could generate a desire to explore that domain, which flows through the standard desire-to-goal pipeline. This would make the framework's motivational structure partially self-generating rather than entirely operator-seeded — addressing the cold-start problem noted on the motivation page.

Looking ahead

The Agent Values Framework is infrastructure. Its value is not in answering every question about agent motivation but in providing the structured, inspectable, storage-agnostic substrate on which those answers can be built and tested. Self-concept extends the hierarchy from normative to descriptive. Active inference could unify the update mechanisms under a single mathematical objective. Multi-agent coordination scales the framework from individual to organisational. Formal verification connects the framework to the broader AI safety programme. And the emerging literature on language-model-native architectures, self-evolution, identity persistence, and intrinsic motivation suggests that the problems the framework addresses are becoming more pressing, not less, as agents grow more autonomous and longer-lived.

Contributions to any of these directions are welcome. The research grounding provides the theoretical context; the experiments page provides the evaluation track record so far; and the codebase provides the modular, tested, typed foundation on which to build.