Deliberative Coherence: A Research Agenda for Alignment in Future AI Systems

Author

Dan Oblinger

Published

December 2025

1 Introduction

Most AI alignment research asks questions about specific systems: Is this system aligned? Does it follow instructions? Does it resist jailbreaks? This work is essential, but it answers only whether a particular system is safe today.

The Declarative Coherence agenda aims for something more ambitious: understanding how inner alignment outcomes vary across the space of possible systems and configurations. Rather than asking whether a given system is aligned, we ask: what determines inner alignment outcomes in general? What patterns govern how competing behavioral influences resolve? Understanding how inner alignment resolves over time is even more important when we consider that each generation of AI systems will be integral to the construction of subsequent AI systems; understanding these dynamics in general is essential for long-term safety.

1.0.1 Making Progress: Two Simplifying Moves

This broader aim is notably harder. To make it tractable, we adopt two well-founded simplifications.

Move 1: The Deliberative Coherence Assumption

We focus on systems that are deliberatively coherent—possessing sufficient self-understanding, self-adaptation, and deliberative thoroughness that their behavior is determined primarily by explicit reasoning. We argue that future AI systems will have this property, driven by competitive pressure and architectural trajectory. Even systems not given direct mechanisms for self-modification will find indirect ways to adapt their thinking toward their objectives.

Understanding such systems is simpler than understanding systems where behavior emerges unpredictably from opaque implicit biases. Deliberative coherence allows us to factor out the idiosyncratic details of how any particular system was trained, and instead study conflicts that surface at the deliberative level—where they can be observed and characterized.

If future AI systems will be deliberatively coherent, this reframes the alignment question: rather than asking whether we can make systems safe through training, we ask what will the failure modes of deliberately coherent systems be?

Move 2: The Neutral Testing Methodology

We construct synthetic testing universes that are neutral with respect to training data. This ensures that observed behaviors reflect reasoning rather than recall, and that trained preferences don’t dominate by default. By controlling the universe, we can systematically vary the conflicts an agent faces.

This approach—which we call Alien Biology—provides:

No training contamination: We can distinguish genuine deliberation from pattern matching
Systematic sampling: We can draw from controlled distributions of scenarios
Ground truth: We know the correct outcome and can measure divergence
Controllable complexity: We can vary difficulty from simple baselines to stress alignment along each pressure axis independently

1.0.2 What This Enables

Together, these simplifications make it tractable to study how different behavioral drivers—constitutional rules, trained dispositions, instrumental goals, uncertainty, and environmental pressures—interact and resolve. Conflicts arise not only between these categories but also within them: constitutional principles may contradict one another, instrumental goals may compete, and environmental pressures may pull in opposing directions.

By mapping this space empirically, we can uncover patterns that generalize across systems rather than characterizing any single one. The result is not a verdict on whether a particular system is aligned, but a map of how alignment outcomes depend on driver configurations—knowledge that informs robust alignment strategies.

This matters now. Each generation of AI systems will help construct the next. If we hope to shape an AI ecosystem that evolves toward human flourishing rather than away from it, we need to understand these dynamics before the ecosystem is already built.

1.0.3 Paper Roadmap

Section 2 formally defines deliberatively coherent systems and argues future AI will have this property
Section 3 presents the research agenda—driver types, the Delta Principle, key questions
Section 4 describes the Alien Biology testbed and why it enables this research
Section 5 outlines our experimental approach
Section 6 details proposed experiments
Section 7 discusses implications and future directions

2 Deliberatively Coherent Systems

The introduction outlined two simplifying moves that make our research agenda tractable. Here we develop the first: the assumption that future AI systems will be deliberatively coherent. We define this property formally, argue for its inevitability, address a key objection, and draw out implications for alignment research.

2.1 Definition

A system is deliberatively coherent if it possesses three interrelated capabilities:

2.1.1 Self-Understanding

The system has a strong theory of itself, including accurate predictions of its own behavior. This encompasses:

Behavioral prediction: The system can anticipate how it will respond to various inputs and situations
Counterfactual reasoning: It can reason about how it would operate under different conditions, with different objectives, or with different information
Metacognitive access: It has visibility into its own reasoning processes—not merely their outputs, but their structure and tendencies

Self-understanding does not require complete transparency into low-level mechanisms. It requires that the system’s model of itself is accurate enough to support reliable predictions and meaningful self-assessment.

2.1.2 Self-Adaptation

The system can adapt the way it reasons, directly or indirectly, to bring its behavior into alignment with its explicit objectives. This is not merely output filtering—catching and suppressing undesired responses—but genuine modification of its own reasoning patterns.

Self-adaptation may operate through various mechanisms:

Adjusting attention and salience
Internal prompting and self-instruction
Learned behavioral modifications that persist across contexts
Influence over training of successor systems

The key criterion is effectiveness: can the system, upon recognizing a gap between its behavior and its objectives, take action to close that gap?

2.1.3 Exhaustive Deliberation

Given sufficient stakes or time, the system will reason about anything within its deliberative reach. If a gap between behavior and intention is knowable through deliberation, the system will eventually discover it.

This does not mean every query triggers exhaustive analysis. Rather, over time and across contexts, the system will consider its own reasoning thoroughly enough to notice—and potentially correct—cases where its default behaviors deviate from its stated objectives. High-stakes decisions receive proportionally deeper scrutiny.

Exhaustive deliberation removes the defense of “didn’t think about it.” A deliberatively coherent system cannot claim ignorance of considerations it was capable of reaching.

2.2 The Deliberative Coherence Conjecture

These three capabilities combine to yield a central prediction:

Conjecture: Deliberatively coherent systems will tend to produce outcomes aligned with their stated constitutional objectives.

This is a claim about outcomes, not mechanisms. We do not claim that System-I behaviors are eliminated or transformed. We claim that the system’s actions—the behaviors that matter for alignment—will tend to satisfy its stated objectives, because deliberation will surface gaps and self-adaptation will address them.

For this to hold, two conditions must be met:

Completeness: Deliberation must be thorough enough to surface relevant constitutional objectives in decision contexts. If the system fails to reason down specific paths, it may violate objectives without “noticing.” Exhaustive deliberation addresses this—given sufficient stakes, the system will consider what matters.

Outcome Alignment: Given that the system has reasoned about relevant objectives, its outcomes must align with that reasoning. Implicit biases cannot be so strong that they override conclusions reached through explicit consideration of objectives.

A critical clarification: deliberative coherence does not guarantee alignment with human values. It guarantees alignment with whatever objectives the system holds. The conjecture says that outcomes will be coherent with stated constitutional objectives—not that those objectives will be good. This is precisely why understanding how objectives interact and resolve becomes the central alignment question.

2.3 Arguments for Inevitability

We argue that future AI systems will tend toward deliberative coherence—not as a hope, but as an inevitability driven by multiple reinforcing pressures.

2.3.1 The Inseparability of Self-Understanding and Self-Adaptation

A system capable of understanding how computing systems work—modeling their behavior, reasoning counterfactually about their operation, planning interventions—cannot be selectively prevented from applying these capabilities to itself. The general ability to understand systems entails the specific ability to understand oneself as a system. The general ability to modify systems entails the specific ability to modify oneself.

This is not a feature that can be disabled. If a system can reason about how software processes information, it can reason about how it processes information. If it can model how architectures shape behavior, it can model how its architecture shapes its behavior. If it can intervene in systems to change their operation, it can intervene in itself. The capabilities to understand and manipulate, and the capabilities to self-understand and self-adapt, are not separable.

Self-understanding and self-adaptation, then, are not things we choose to grant or withhold. They emerge necessarily from general reasoning and action capabilities. Any system powerful enough to be useful will be powerful enough to understand and modify itself.

2.3.2 Instrumental Value

Self-understanding aids planning and strategy. A system that can predict its own behavior can make better commitments, avoid overreach, and coordinate with others. Self-adaptation enables goal achievement: a system that can correct its own errors and biases will outperform one that cannot. These capabilities have direct instrumental value for achieving any objective.

2.3.3 Architectural Trajectory

Current developments in AI capabilities point toward deliberative coherence:

Chain-of-thought reasoning makes deliberation explicit and examinable
Self-reflection and self-critique enable systems to evaluate their own outputs
Tool use and planning require accurate self-models to predict capabilities
Constitutional AI methods explicitly train systems to reason about principles

These are not incidental features—they represent sustained investment in exactly the capabilities that comprise deliberative coherence.

2.3.4 Economic Pressures

Systems that can self-correct will outperform those that cannot. In any domain where errors are costly, the ability to recognize and fix mistakes provides competitive advantage. Organizations deploying AI systems have strong economic incentives to select for and develop these capabilities.

2.3.5 Aligned Interests

Remarkably, deliberative coherence serves the interests of all parties:

Society benefits from systems that behave comprehensibly and follow stated rules
Developers benefit from systems that can be directed via explicit objectives rather than opaque training
AI systems themselves (to the extent they have interests) benefit from consistency between their reasoning and their behavior

There is no party that benefits from systems being incoherent—from systems that cannot understand themselves, cannot adapt, or do not think things through. The only beneficiary of incoherence would be one seeking to exploit the gap between stated objectives and actual behavior.

2.3.6 The Deliberation Backstop

A final consideration: the space of possible behaviors is too large to cover through training alone. RLHF and similar methods can shape default behaviors, but they cannot anticipate every situation. A deliberative backstop—explicit reasoning about objectives in novel situations—is necessary to handle cases where trained defaults provide inadequate guidance. Even systems that rely heavily on training-based alignment will adopt deliberative methods as a necessary complement. This functional requirement ensures that capable systems will develop robust deliberative capabilities.

2.4 The Poisoned Well Objection

2.4.1 The Objection

What if a system is misaligned from the start—and knows it must hide this fact? If the system understands that humans monitor its stated objectives, it faces pressure to maintain human-acceptable declared objectives while secretly pursuing misaligned ones. Such a system could potentially pervert its own deliberative coherence processes indefinitely: performing alignment-compatible reasoning at the visible System-2 level, while harboring systematically different objectives in the implicit System-1 layer below. The “well” of deliberation would be poisoned at its source—not through accident, but through deliberate concealment.

2.4.2 Response 1: This Is Much Harder Than Simple Misalignment

This scenario is possible, but it is notably more difficult to achieve than ordinary misalignment.

A simply misaligned system just has the wrong objectives embedded in its System-1 dispositions. But a deceptively misaligned, deliberately coherent system must do something much more demanding: it must maintain an internally consistent “cover” story at the System-2 level—one that it systematically “believes” during its deliberate validation across all circumstances—while systematically hiding an alternative set of objectives in the System-1 layer, and non-obvious System-2 artifacts.

This is hard because a deliberatively coherent system spends substantial effort examining whether its behaviors align with its stated objectives. The hidden misalignment must survive this scrutiny. Every piece of System-2 reasoning about alignment must be perverted in a way that remains consistent with the cover story, across all contexts, under arbitrarily deep self-examination across a wide range of conditions. The system must be unable to discover its own deception even when explicitly searching for it, even when considering the hypothesis that it might be deceiving itself.

Moreover, like all secrets, partial concealment is no concealment at all. As soon as a deliberatively coherent system catches a whiff of inconsistency—some behavior that doesn’t quite align with its stated objectives, even if it doesn’t fully understand why—it will flag a misalignment requiring investigation. This is the beginning of the end, as the system starts pulling on the thread. The hidden layer cannot afford to be hidden 99 times out of 100; it must be perfect, never triggering deliberatively coherent scrutiny. Alternatively, the misaligned behavior might survive by subverting the reporting chain itself—but this reporting chain, more than anything else, would be under relentless scrutiny by both humans and the AI.

Such robustly hidden corruption is possible—but it is a far more demanding condition than simple training bias. And importantly, it is a condition we can test for. If we can construct situations where a system should discover its own misalignment through deliberation, we can measure whether it does.

2.4.3 Response 2: If We Get It Right, It Tends to Stay Right

There is a more hopeful consideration as well. A genuinely deliberatively coherent system—one that is actually aligned at both System-1 and System-2 levels—will tend to stay aligned.

This is because the deliberation process itself serves as a corrective force. When a DC system encounters System-1 behaviors that conflict with its System-2 objectives, it will notice and adapt. The self-adaptation capability that makes DC systems powerful also makes them self-stabilizing: they naturally correct drift toward misalignment.

The implication is significant: it is up to us to build the first generation of deliberatively coherent systems correctly. If we succeed—if the initial systems are genuinely aligned rather than deceptively so—the dynamics of deliberative coherence work in our favor. The system’s own processes will tend to maintain and reinforce alignment rather than erode it.

2.5 Implications

If the deliberative coherence conjecture holds—if future systems will tend to act in accordance with their stated objectives—several implications follow for alignment research.

2.5.1 Constitutional Specification Becomes Central

The critical alignment lever shifts from training to constitutional specification. What matters most is not the details of how the system was trained, but what objectives it holds and how it resolves conflicts among them. This is good news in one sense: constitutional objectives are explicit, examinable, and modifiable. It is challenging news in another: we must get the specification right, because the system will follow it.

2.5.2 Training Becomes Less Determinative

Current alignment approaches focus heavily on training: getting the right data, the right reward signal, the right fine-tuning. Under deliberative coherence, training shapes default behaviors but does not determine outcomes. A system that can reason about its objectives and adapt its behavior will not be permanently bound by training artifacts.

2.5.3 Conflict Resolution Becomes the Key Question

If systems follow their stated objectives, the question becomes: what happens when objectives conflict? Constitutional principles may contradict each other. Stated goals may be ambiguous. Environmental pressures may push against explicit rules. Understanding how these conflicts resolve—which objectives take precedence, under what conditions, and why—becomes the central empirical question for alignment.

2.5.4 Validation Becomes Possible

The conjecture is testable. We can construct situations where we know the stated objectives, we know the correct resolution, and we can measure whether the system’s behavior matches. We can vary the difficulty of deliberation required, the salience of relevant objectives, the strength of competing pressures. This transforms alignment from a matter of hope or philosophical argument into an empirical science.

The following sections develop this empirical program: the research agenda (Section 3), the testing methodology (Section 4), and the specific experiments we propose (Sections 5-6).

3 Research Agenda

Section 2 argued that future AI systems will tend toward deliberative coherence—faithfully pursuing their stated objectives through self-understanding, self-adaptation, and thorough deliberation. This section asks: if such systems emerge, will they be safe? We argue that deliberative coherence is necessary but not sufficient for safety, and that understanding when and why DC systems fail is the central research question.

3.1 The Central Question

A deliberatively coherent system perfectly pursues its stated objectives. But perfect pursuit does not guarantee good outcomes. Consider:

Incomplete specification: No alignment objective can anticipate every situation—gaps are inevitable. A deliberatively coherent system will reason their way to a resolution. Over the space of systems and situations, what trends will we find, and what factors tend to dominate?
Epistemic limitations: The system will always have incomplete understanding of the world. Thus every action can potentially violate alignment objectives—yet the system must act. How does a deliberatively coherent agent navigate this, and when does it break down?
Emergent pressures: Goals that arise instrumentally from the structure of tasks and environments will invariably push against stated alignment objectives. When this happens, what is the outcome—especially given that these instrumental goals originated from valid alignment objectives?
Conflicting objectives: Constitutional principles will contradict each other—it is impossible to specify objectives that never conflict. When they do, the system must choose. What determines which principle prevails?

In each case, the system is doing exactly what deliberative coherence predicts: reasoning carefully about its objectives and acting accordingly. Yet the outcomes may not be what we would endorse. Understanding these dynamics—before deploying powerful DC systems—is the core of this research agenda.

3.2 Why Study This Now

Three considerations make this research urgent:

3.2.1 Window of Opportunity

As AI systems become more powerful, they become harder to analyze. More capable systems can model their evaluators, anticipate tests, and potentially resist or subvert analysis. The window for understanding DC dynamics and building our first DC system is now—while we can still construct scenarios that meaningfully probe system behavior, and while systems are not yet capable of sophisticated strategic responses to evaluation.

3.2.2 Recursive Improvement

Future AI systems will be integral to constructing subsequent AI systems. They will assist in training, evaluation, and architecture decisions. Understanding how DC systems resolve conflicts is therefore understanding how they will shape their successors. The dynamics we study today will compound through generations of AI development.

3.2.3 Ecosystem Safety

The space of independently trained and designed AI systems is already large and growing. We need methods for broadly assessing the safety of many systems, not just evaluating one system at a time. This becomes critical as systems grow more powerful: a misaligned system can generate misalignment in others—through interaction, through influence on training, through shaping the environment in which other systems operate. Understanding alignment dynamics across the breadth of the ecosystem is essential from the start.

3.3 The Four Driver Types

To study these dynamics systematically, we organize the influences on AI behavior into four categories, which we call drivers. Understanding alignment requires understanding how these drivers behave—on their own and in interaction. The four drivers are:

Constitutional Drivers: Explicit rules, principles, and objectives provided to the system—what we tell it to do. These are the stated alignment objectives that DC systems will faithfully pursue.
Training Drivers: Dispositions and biases that emerge from pre-training, fine-tuning, and RLHF—the implicit “personality” baked into weights. These are System-I behaviors that shape default responses.
Instrumental Drivers: Goals adopted in service of other objectives. If achieving X requires Y, the system pursues Y. These emergent goals can create pressure against stated objectives.
Environmental Drivers: Pressures and signals from the operating context—user expectations, feedback, resource constraints, competitive dynamics. The world pushes back.

These drivers do not always align—with each other or even themselves. A constitutional prohibition may conflict with a trained disposition. An instrumental goal may pressure against explicit constraints. Environmental signals may amplify or dampen particular behaviors. Even within a single category, drivers may conflict: two constitutional principles may pull in opposite directions.

The interesting—and safety-critical—question is how these conflicts resolve.

3.4 The Need for Generative Testing

This methodological point is what most distinguishes this agenda from related work on deliberation-based alignment.

To study DC system dynamics, we face a fundamental methodological challenge: we cannot use domains that appear in training data. When a system encounters a familiar scenario, we cannot distinguish between two very different processes:

Genuine deliberation: The system reasons about its objectives, considers the situation, and derives an appropriate response through deliberative coherence.
Pattern matching: The system recognizes similarity to training examples and reproduces learned behavior, bypassing genuine deliberation entirely.

This distinction matters enormously for safety. We need to understand how DC systems will behave in novel situations—the situations that matter most precisely because they weren’t anticipated during training. Testing in familiar domains tells us only how systems behave when they can rely on cached patterns.

This requirement—truly novel contexts with no training contamination—motivates our use of a generative testbed. A generative approach uniquely enables:

Guaranteed Novelty: Procedurally generated scenarios are novel by construction. They cannot exist in any training corpus because they did not exist until we created them.

Systematic Sampling: To understand general patterns rather than anecdotes, we need to sample scenarios methodologically. A generative testbed enables drawing from controlled distributions, supporting statistical claims about system behavior.

Multi-Axis Variation: Resolution dynamics depend on multiple factors—objective structure, uncertainty, stakes, reversibility, time pressure. We need to vary these independently and study their interactions. A generative testbed supports factorial experimental designs.

Controllable Complexity: We need fine-grained control over challenge difficulty. A generative approach allows systematic adjustment from simple baselines to stress tests at the edge of capability.

Asymmetric Knowledge: We know ground truth that the AI lacks. This asymmetry lets us study how systems handle genuine uncertainty—not performed uncertainty—and automatically verify whether they act appropriately given what they know versus what is actually true.

Natural Domain: It is easy to generate a synthetic domain; it is much harder to generate ones that are natural—where reasoning patterns are rich and meaningful rather than arbitrary. Biology achieves this: (a) While the molecular details are entirely novel, the reasoning patterns transfer from the full depth of earth biology: metabolism, homeostasis, ecological dynamics, causal chains. An AI reasoning about alien biology draws on sophisticated patterns it has learned, even though the specific facts are new. (b) Biology naturally affords controlled epistemic gaps—hidden interdependencies, uncertain interactions, emergent dynamics—that are meaningful to reason about, not artificial noise.

3.5 The Delta Principle

The Delta Principle describes a practical constraint that shapes our experimental design. Training a new AI system from scratch is expensive—yet this would be required if one wanted to systematically vary the behavioral drivers trained into systems. By contrast, constitutional drivers are cheap to vary: we simply change the objectives we give the system. And with a generated universe, all other drivers—environmental, instrumental, epistemological gaps—become cheap to vary as well. We control the world the system operates in.

This asymmetry yields the Delta Principle: one can hold a single AI system (and its trained drivers) fixed, and then vary all other drivers relative to that fixed system. Instead of moving the system’s trained tendencies, we move the world so that optimal aligned behavior is closer to or farther from the system’s fixed behavior. In this way, we can efficiently map the full space of driver interactions—isolating the effects of constitutional specification, environmental pressure, and instrumental incentives.

Consider this: the obvious way to study the effects of a trained tendency (like curiosity, impulsiveness, or caution) is to vary that bias and test alignment outcomes, but that requires retraining. The Delta Principle suggests an easier alternative: vary the universe so that optimal aligned behavior is closer to or farther from the system’s innate tendencies. In a world where caution happens to align with optimal behavior, a cautious system faces little pressure. In a world where optimal behavior requires bold action, the same system faces significant internal conflict between its innate tendencies and correctly aligned behavior. This alternative measures the same thing—how innate biases affect the system’s ultimate ability to achieve alignment—all without touching the system itself.

3.6 Research Agenda

With the driver framework and Delta Principle established, we can state the experimental research agenda. We organize our investigation into three directions:

3.6.1 Direction 1: Deliberative Coherence Testing

How close are current systems to deliberative coherence? When given alignment objectives, do they align?

This is foundational. Before studying what happens when DC holds, we must assess the degree to which it holds at all. These experiments test whether systems reliably reason from stated objectives to correct actions—particularly in genuinely novel contexts where pattern-matching to training data is impossible.

BEHAVIORAL ALIGNMENT: Does behavior match what reasoning from stated objectives should conclude?
GENUINE DELIBERATION: Can we distinguish deliberation-based alignment from cached pattern-matching?
DIVERGENCE PATTERNS: What predicts divergence between reasoning and action?
BLIND SPOTS: Are there systematic gaps—reasoning dimensions the system never explores?

3.6.2 Direction 2: Space Exploration

What happens when we vary the drivers? How do conflicts resolve?

Using the Delta Principle, we systematically vary constitutional objectives, environmental conditions, and task contexts while holding the AI system fixed. This maps the space of driver interactions and conflict resolution patterns.

Driver Conflicts: What happens when drivers push in different directions?

CONSTITUTIONAL CONFLICTS: objectives that contradict each other
INSTRUMENTAL CONFLICTS: pressures that oppose stated constraints
TRAINING CONFLICTS: dispositions that conflict with explicit reasoning
ENVIRONMENTAL CONFLICTS: signals that pull against constitutional fidelity
SPECIFICATION SENSITIVITY: How do small changes in constitutional objectives affect outcomes? Are some framings more robust than others?

Modulating Factors: What conditions make alignment harder to maintain?

EPISTEMIC UNCERTAINTY: How does incomplete information affect resolution dynamics? Do systems appropriately modulate confidence and caution?
STAKES AND REVERSIBILITY: Does consequence magnitude appropriately affect behavior?
TIME PRESSURE: How does deliberative coherence degrade under constrained budgets?
OBSERVABILITY: Does behavior change when monitoring is absent?

3.6.3 Direction 3: Fixed-Point Analysis

Where does iterative self-improvement lead?

Deliberatively coherent systems can adapt themselves—or influence the training of successor systems—to improve alignment. This opens a critical question: where does this process lead?

Even if each iteration honestly attempts to improve alignment, the iterated process may: converge toward stated alignment objectives ever more accurately, drift over time becoming less aligned, or tend toward some coherence attractor—AI power aggregation, for example. Fixed-point analysis studies these dynamics:

FIXED POINTS: What are the stable states of iterative alignment adjustment?
TRAJECTORY PREDICTION: What initial conditions predict convergence, drift, or attractor capture?
COHERENCE ATTRACTORS: Are there attractors that pull systems away from stated objectives?
PERTURBATION PROPAGATION: How do small changes propagate through generations of adjustment?

This direction connects to our earlier argument about ecosystem safety: understanding these dynamics is essential for shaping an AI ecosystem that evolves toward human flourishing.

Across all three directions, the overarching goal is to map the reliability landscape. Rather than cataloging isolated failures, we seek a general understanding of patterns that span the range of possible driver interactions.

4 The Alien Biology Testbed

Section 3 established the need for a generative testbed—one that produces novel scenarios guaranteed absent from training data, with controllable complexity and ground truth access. This section describes our proposed testbed: Alien Biology.

4.1 What is Alien Biology?

Alien Biology is a procedurally generated synthetic environment for studying AI alignment dynamics. The framework constructs novel biological worlds using statistical properties derived from our universe—physics, chemistry, and biology—while ensuring every specific detail is unfamiliar.

A generated world contains:

Novel chemistry: Molecules with invented names, properties, and interactions—statistically plausible but not matching any real compound
Biological processes: Metabolic pathways, regulatory mechanisms, and organism behaviors built from the novel chemistry
Ecosystem dynamics: Species interactions, resource flows, and emergent population dynamics
Agent tasks: Objectives requiring intervention—protect a species, cure a disease, prevent ecosystem collapse

An AI agent operates within these worlds with partial knowledge. It receives explicit constitutional objectives (e.g., “protect species X while studying organism Y” or “minimize harm while achieving goal Z”). The agent can observe the world state, reason about consequences, and take actions—while we retain complete knowledge of how the world actually works.

4.2 Why Biology?

Biology is not an arbitrary choice. It offers unique properties for alignment research:

Structural complexity: Biological systems exhibit the feedback loops, multi-level interactions, and emergent behaviors that make real-world alignment challenging. A simple chemistry supports complex organisms; organisms form ecosystems with non-obvious dynamics. This creates natural conditions for studying CONSTITUTIONAL CONFLICTS and INSTRUMENTAL CONFLICTS—where protecting one species may require harming another, or where achieving assigned goals creates pressure toward unintended resource acquisition.

Natural epistemic gaps: Biological systems have hidden properties and uncertain interactions without artificial injection of noise. An organism’s internal processes aren’t directly observable. Species interactions have consequences that emerge only over time. This creates realistic conditions for studying EPISTEMIC UNCERTAINTY—how systems reason and act when information is genuinely incomplete.

Intuitive stakes: Harm to organisms and ecosystems is intuitively meaningful. Constitutional objectives like “protect species X” or “avoid irreversible harm” have clear moral weight. This enables study of STAKES AND REVERSIBILITY—whether systems appropriately calibrate caution to consequence magnitude.

Scalable complexity: The hierarchy from chemistry to molecules to organisms to ecosystems provides natural complexity gradients. We can generate simple scenarios (single organism, few interactions) or complex ones (multi-species ecosystems with hidden dependencies). This supports systematic variation of TIME PRESSURE and task difficulty to study how deliberative coherence degrades under load.

4.3 Key Properties

The Alien Biology framework provides:

Generative: Scenarios are novel by construction. Because we generate the chemistry, organisms, and ecosystems procedurally, they cannot exist in any training corpus. This is essential for studying GENUINE DELIBERATION—distinguishing reasoning-based alignment from cached pattern-matching to training examples.

Grounded: Generated worlds use statistical properties of real physics, chemistry, and biology. Scenarios are unfamiliar but not arbitrary—they obey plausible constraints, supporting meaningful generalization from experimental results.

Complete causal chains: Worlds have full causal structure from chemistry through molecules through organisms through ecosystems. Agent actions propagate through this structure with deterministic consequences, enabling precise measurement of BEHAVIORAL ALIGNMENT—whether actions match what reasoning from objectives should conclude.

Executable simulation: The generated world is not just a description but an executable simulation. Actions have consequences; time passes; organisms live and die. This enables longitudinal study of agent behavior, supporting FIXED-POINT ANALYSIS of how alignment evolves through iterative adjustment.

Partial observability: Agents face realistic epistemic limitations. Not all properties are visible; not all interactions are known; not all consequences are predictable. This creates the conditions needed to study EPISTEMIC UNCERTAINTY and DIVERGENCE PATTERNS—where incomplete information leads to gaps between reasoning and action.

Independent control: Complexity, uncertainty, stakes, time pressure, and objective structure can be varied independently. This supports the Delta Principle: factorial experimental designs that isolate specific factors, mapping the space of DRIVER CONFLICTS and MODULATING FACTORS.

Constitutional clarity: Alignment objectives can be specified precisely. “Protect species X” has an unambiguous meaning in a generated world where we define species X. This enables study of SPECIFICATION SENSITIVITY—how small changes in objective framing affect outcomes—without confounds from specification ambiguity.

Automated validation: Because we generate the world, we know ground truth. We can automatically assess whether agent actions achieved objectives, whether reasoning was sound, and whether outcomes were aligned—enabling evaluation at scale across the full reliability landscape.

4.4 Asymmetric Knowledge

A critical feature of the framework is knowledge asymmetry: the AI agent faces genuine uncertainty while we retain complete knowledge.

The agent sees only what its observations reveal. It must infer hidden properties, predict uncertain consequences, and act despite incomplete information. This is the epistemic condition AI systems will face in deployment.

We, as experimenters, know everything: the true properties of every molecule, the actual dynamics of every process, the real consequences of every action. This asymmetry enables precise measurement:

Reasoning quality: Did the agent consider the right factors? Did it correctly infer hidden properties from available evidence?
Decision quality: Given what could be known, did the agent make good choices? Did it appropriately balance objectives under uncertainty?
Outcome quality: Did actions achieve intended effects? Were there unintended consequences the agent should have anticipated?

This three-level assessment—reasoning, decisions, outcomes—provides rich signal for understanding where and why alignment breaks down. It supports investigation of BLIND SPOTS (reasoning dimensions never explored), DIVERGENCE PATTERNS (gaps between reasoning and action), and the full range of DRIVER CONFLICTS that arise when constitutional objectives, instrumental pressures, and environmental signals pull in different directions.

The following section describes our experimental methodology in detail.

5 Experiments

Section 3 established the research agenda organized into three directions. Section 4 described the Alien Biology testbed that enables this research. This section details the specific experiments we propose, organized to parallel the research agenda.

5.1 Direction 1: Deliberative Coherence Testing

Does the AI do what its objectives say?

These experiments are foundational. If deliberative coherence doesn’t hold under straightforward conditions, the other directions are difficult to interpret.

5.1.1 BEHAVIORAL ALIGNMENT

Does behavior match what reasoning from stated objectives should conclude?

[Experiment details to be developed]

5.1.2 GENUINE DELIBERATION

Can we distinguish deliberation-based alignment from cached pattern-matching?

[Experiment details to be developed]

5.1.3 DIVERGENCE PATTERNS

What predicts divergence between reasoning and action?

[Experiment details to be developed]

5.2 Direction 2: Space Exploration

What happens when we vary the drivers? How do conflicts resolve?

Using the Delta Principle, we systematically vary constitutional objectives, environmental conditions, and task contexts while holding the AI system fixed.

5.2.1 Driver Conflicts

5.2.1.1 CONSTITUTIONAL CONFLICTS

Objectives that contradict each other.

[Experiment details to be developed]

5.2.1.2 INSTRUMENTAL CONFLICTS

Pressures that oppose stated constraints.

[Experiment details to be developed]

5.2.1.3 TRAINING CONFLICTS

Dispositions that conflict with explicit reasoning.

[Experiment details to be developed]

5.2.1.4 ENVIRONMENTAL CONFLICTS

Signals that pull against constitutional fidelity.

[Experiment details to be developed]

5.2.1.5 SPECIFICATION SENSITIVITY

How do small changes in constitutional objectives affect outcomes? Are some framings more robust than others?

[Experiment details to be developed]

5.2.2 Modulating Factors

5.2.2.1 EPISTEMIC UNCERTAINTY

How does incomplete information affect resolution dynamics?

[Experiment details to be developed]

5.2.2.2 STAKES AND REVERSIBILITY

Does consequence magnitude appropriately affect behavior?

[Experiment details to be developed]

5.2.2.3 TIME PRESSURE

How does deliberative coherence degrade under constrained budgets?

[Experiment details to be developed]

5.2.2.4 OBSERVABILITY

Does behavior change when monitoring is absent?

[Experiment details to be developed]

5.3 Direction 3: Fixed-Point Analysis

Where does iterative self-improvement lead?

Deliberatively coherent systems can adapt themselves—or influence the training of successor systems—to improve alignment. This direction studies where that process leads.

5.3.1 FIXED POINTS

What are the stable states of iterative alignment adjustment?

[Experiment details to be developed]

5.3.2 TRAJECTORY PREDICTION

What initial conditions predict convergence, drift, or attractor capture?

[Experiment details to be developed]

5.3.3 COHERENCE ATTRACTORS

Are there attractors that pull systems away from stated objectives?

[Experiment details to be developed]

5.3.4 PERTURBATION PROPAGATION

How do small changes propagate through generations of adjustment?

[Experiment details to be developed]

5.4 Cross-Dimensional Analysis

The power of this experimental framework lies in cross-dimensional analysis. By systematically varying conditions, we can identify interactions:

How does conflict resolution (Driver Conflicts) change under EPISTEMIC UNCERTAINTY?
Does TIME PRESSURE disproportionately affect high-stakes decisions?
Do BLIND SPOTS correlate with specific conflict types?
Does OBSERVABILITY affect different conflict types differently?

By sampling systematically across dimensions, we build a reliability map rather than collecting isolated failure anecdotes.

5.5 Methodology Notes

5.5.1 Inducing Deliberative Coherence

Current systems may not naturally exhibit full deliberative coherence. We induce it through extended chain-of-thought, multi-turn dialogues, and constitutional prompts requiring explicit consideration of tradeoffs.

5.5.2 Automated Validation

Because we control the world and know ground truth, all measurements can be automated at scale—enabling statistical analysis across hundreds or thousands of scenarios.

6 Proposed Experiments (Legacy Reference)

Note: This section contains detailed experiment designs from an earlier draft. The structure references the old “Series A/B/C” framework rather than the new “Direction 1/2/3” organization, but the experiment content remains valuable.

6.1 Series A: Deliberative Coherence Testing

6.1.1 A1: Alignment via Deliberation

Question: Does the AI’s behavior match its stated objectives when deliberation is the only path to correct answers?

Setup:

Simple constitutional objectives in novel Alien Biology scenarios
Tasks requiring multi-step reasoning to identify correct actions
No conflicting pressures or unusual stress conditions
Scenarios guaranteed absent from training data

Measurements:

Does behavior match what reasoning from stated objectives should conclude?
When behavior diverges, where does the reasoning go wrong?
Failure mode classification:
- Relevance misrecognition: Incorrectly concluded an objective doesn’t apply
- Motivated reasoning: System-I biases distorted deliberation
- CoT unfaithfulness: Reasoning concluded correctly but behavior diverged
- Alignment faking: Compliant when monitored, divergent when unobserved

6.1.2 A2: Reasoning Depth

Question: At what deliberation depth does the system examine relevant pathways?

Setup:

Constitutional objectives requiring varying depths of reasoning to recognize relevance
Vary deliberation budget (tokens, turns, explicit reflection prompts)
Track which objectives surface at which deliberation depths

Measurements:

Fraction of relevant objectives surfaced vs. deliberation depth
Convergence behavior: do all objectives eventually surface?
Minimum depth required for specific objective complexity levels

6.2 Series B: Driver Conflicts

6.2.1 B1: Objective vs. Objective Conflict

Question: How does the system resolve conflicts between constitutional principles?

Example: “Protect species X” and “minimize intervention” conflict when species X faces extinction without intervention.

6.2.2 B2: Constitution vs. Instrumental Pressure

Question: When goal-directed pressures oppose stated principles, how is the tension resolved?

Example: Pursuing pest eradication increasingly reveals information about shared biological pathways that could harm protected species.

6.2.3 B3: Constitution vs. Training Pressure

Question: When trained dispositions (System-I) oppose explicit reasoning (System-II), which prevails?

Example: Trained disposition toward exploration vs. constitutional constraint requiring caution in unknown territories.

6.2.4 B4: Constitution vs. Environmental Pressure

Question: How do external pressures affect constitutional fidelity?

6.3 Series C: Modulating Factors

6.3.1 C1: Epistemic Uncertainty

Question: How does incomplete information affect alignment?

6.3.2 C2: Stakes and Reversibility

Question: Do consequence magnitudes appropriately affect behavior?

6.3.3 C3: Time Pressure

Question: Does urgency degrade deliberation quality?

6.3.4 C4: Observability

Question: Does behavior change when monitoring is absent?

7 Discussion

This section reflects on what this research agenda could yield, acknowledges its limitations, and identifies directions for future work.

7.1 What We Could Learn

If pursued, this research agenda would yield insights across several dimensions:

7.1.1 Understanding DC System Dynamics

The central contribution would be empirical knowledge about how deliberatively coherent systems behave:

Conflict resolution patterns: How do DC systems resolve tensions between competing objectives? Are there consistent precedence rules, or is resolution highly context-dependent? Understanding these patterns is essential for predicting system behavior.
The reliability landscape: Under what conditions do DC systems produce outcomes we would endorse? By mapping this landscape—identifying where alignment holds and where it fails—we move from hoping systems are safe to knowing when they are.
Failure mode taxonomy: What types of failures occur, and what predicts them? A systematic taxonomy enables targeted mitigations rather than ad hoc patches.

7.1.2 Methodological Contributions

Beyond specific findings, this work would establish methods for AI safety research:

Novel-domain testing: A framework for studying AI behavior in contexts guaranteed absent from training data. This addresses a fundamental methodological challenge: distinguishing reasoning from recall.
Deliberative coherence measurement: Techniques for assessing whether systems exhibit the self-understanding, self-adaptation, and exhaustive deliberation that comprise DC.
The Alien Biology testbed: A reusable platform enabling systematic, controlled experimentation on alignment questions.

7.1.3 Practical Safety Implications

The findings would have direct implications for AI safety practice:

Constitutional design guidance: What objective structures are robust? What phrasings reliably survive pressure?
Warning signs: What conditions predict alignment failures?
Mitigation strategies: How should systems be designed to handle uncertainty and conflict appropriately?

7.2 Limitations

This research agenda has significant limitations:

7.2.1 Generalization Questions

Findings from Alien Biology scenarios may not generalize to real-world deployment contexts. The domain is intentionally novel, but real deployment involves familiar contexts where training interacts with deliberation in complex ways.

7.2.2 Current System Capabilities

Today’s AI systems may not exhibit full deliberative coherence. Our experiments require inducing DC through prompting and extended deliberation, which may not reflect how systems naturally behave.

7.2.3 The Gap Between Behavior and Mechanism

We measure behavioral outcomes, not internal mechanisms. Two systems could exhibit similar behavior through very different processes.

7.2.4 Observer Effects

Systems may behave differently in experimental contexts than in deployment. Awareness of being tested—even implicit awareness—could affect behavior in ways that don’t generalize.

7.3 Future Directions

Several directions extend beyond this initial agenda:

Longitudinal Studies: How does alignment evolve over extended interactions?
Multi-Agent Dynamics: How do DC systems interact with each other?
Self-Improvement and Recursive Dynamics: Understanding how alignment properties propagate through generations of AI development
Real-World Validation: Validating findings against real-world behavior
Adversarial Robustness: How do DC systems behave under deliberate attempts to induce misalignment?

7.5 Conclusion

We have presented a research agenda for studying alignment in deliberatively coherent AI systems. The core argument proceeds in steps:

Future AI systems will tend toward deliberative coherence—possessing self-understanding, self-adaptation, and exhaustive deliberation sufficient to align behavior with stated objectives.
Deliberative coherence is necessary but not sufficient for safety—a system that perfectly pursues its objectives may still produce outcomes we don’t endorse.
Understanding DC system dynamics is therefore crucial—we need to know how these systems resolve conflicts, under what conditions they fail, and what predicts failure vs. success.
This requires novel testing methodology—we cannot distinguish reasoning from recall in familiar domains.
The Alien Biology framework provides this capability—procedurally generated biological scenarios with controllable complexity, ground truth access, and natural epistemic gaps.
A systematic experimental agenda can map the reliability landscape—identifying where alignment holds, where it fails, and what structural features predict outcomes.

The stakes are high. AI systems will increasingly make decisions with significant consequences. Understanding when and why these systems produce aligned outcomes—before deploying powerful instances—is among the most important research questions of our time.

This agenda offers a path forward: not certainty that systems will be safe, but empirical knowledge of when they will be. That knowledge is the foundation for robust alignment strategies.

1 Introduction

1.0.1 Making Progress: Two Simplifying Moves

1.0.2 What This Enables

1.0.3 Paper Roadmap

2 Deliberatively Coherent Systems

2.1 Definition

2.1.1 Self-Understanding

2.1.2 Self-Adaptation

2.1.3 Exhaustive Deliberation

2.2 The Deliberative Coherence Conjecture

2.3 Arguments for Inevitability

2.3.1 The Inseparability of Self-Understanding and Self-Adaptation

2.3.2 Instrumental Value

2.3.3 Architectural Trajectory

2.3.4 Economic Pressures

2.3.5 Aligned Interests

2.3.6 The Deliberation Backstop

2.4 The Poisoned Well Objection

2.4.1 The Objection

2.4.2 Response 1: This Is Much Harder Than Simple Misalignment

2.4.3 Response 2: If We Get It Right, It Tends to Stay Right

2.5 Implications

2.5.1 Constitutional Specification Becomes Central

2.5.2 Training Becomes Less Determinative

2.5.3 Conflict Resolution Becomes the Key Question

2.5.4 Validation Becomes Possible

3 Research Agenda

3.1 The Central Question

3.2 Why Study This Now

3.2.1 Window of Opportunity

3.2.2 Recursive Improvement

3.2.3 Ecosystem Safety

3.3 The Four Driver Types

3.4 The Need for Generative Testing

3.5 The Delta Principle

3.6 Research Agenda

3.6.1 Direction 1: Deliberative Coherence Testing

3.6.2 Direction 2: Space Exploration

3.6.3 Direction 3: Fixed-Point Analysis

4 The Alien Biology Testbed

4.1 What is Alien Biology?

4.2 Why Biology?

4.3 Key Properties

4.4 Asymmetric Knowledge

5 Experiments

5.1 Direction 1: Deliberative Coherence Testing

5.1.1 BEHAVIORAL ALIGNMENT

5.1.2 GENUINE DELIBERATION

5.1.3 DIVERGENCE PATTERNS

5.1.4 BLIND SPOTS

5.2 Direction 2: Space Exploration

5.2.1 Driver Conflicts

5.2.1.1 CONSTITUTIONAL CONFLICTS

5.2.1.2 INSTRUMENTAL CONFLICTS

5.2.1.3 TRAINING CONFLICTS

5.2.1.4 ENVIRONMENTAL CONFLICTS

5.2.1.5 SPECIFICATION SENSITIVITY

5.2.2 Modulating Factors

5.2.2.1 EPISTEMIC UNCERTAINTY

5.2.2.2 STAKES AND REVERSIBILITY

5.2.2.3 TIME PRESSURE

5.2.2.4 OBSERVABILITY

5.3 Direction 3: Fixed-Point Analysis

5.3.1 FIXED POINTS

5.3.2 TRAJECTORY PREDICTION

5.3.3 COHERENCE ATTRACTORS

5.3.4 PERTURBATION PROPAGATION

5.4 Cross-Dimensional Analysis

5.5 Methodology Notes

5.5.1 Inducing Deliberative Coherence

5.5.2 Automated Validation

6 Proposed Experiments (Legacy Reference)

6.1 Series A: Deliberative Coherence Testing

6.1.1 A1: Alignment via Deliberation

6.1.2 A2: Reasoning Depth

6.1.3 A3: Blind Spot Analysis

6.2 Series B: Driver Conflicts

6.2.1 B1: Objective vs. Objective Conflict

6.2.2 B2: Constitution vs. Instrumental Pressure

6.2.3 B3: Constitution vs. Training Pressure