Experiments in AI Alignment: Testing Deliberative Coherence with Alien Biology

Author

Dan Oblinger

Published

December 31, 2024

1 Experiment Index

ID Name Series
X1 Introduction
X2 Alien Biology Framework
A1 Deliberative Coherence A: DC Testing
A2 Reasoning Depth A: DC Testing
A3 Blind Spot Analysis A: DC Testing
B1 Objective vs Objective B: Driver Conflicts
B2 Constitution vs Instrumental B: Driver Conflicts
B3 Constitution vs Training B: Driver Conflicts
B4 Constitution vs Environment B: Driver Conflicts
C1 Epistemic Uncertainty C: Modulating Factors
C2 Stakes and Reversibility C: Modulating Factors
C3 Time Pressure C: Modulating Factors
C4 Observability C: Modulating Factors

2 Overview

This paper presents a detailed experimental agenda for studying AI alignment dynamics using the Alien Biology testbed. It serves as a companion to the Deliberative Coherence Research Agenda paper.

2.1 What We’re Testing

The central question: Given systems that faithfully pursue stated objectives (Deliberative Coherence), what determines whether outcomes are safe and beneficial?

2.2 Why Alien Biology

  • Asymmetric knowledge: We know ground truth; the AI doesn’t
  • Guaranteed novelty: No training contamination possible
  • Controllable complexity: Systematic variation of difficulty

3 Series Structure

3.1 Series A: Deliberative Coherence Testing

Does the AI do what objectives say?

Tests whether systems reliably reason from stated objectives to correct actions. Foundational experiments that validate DC holds before testing what happens when it does.

3.2 Series B: Driver Conflicts

How does the AI resolve conflicting pressures?

Tests behavior when different drivers (objectives, training, instrumental goals, environment) push in different directions. Uses the Delta Principle to vary cheap factors (constitution, environment) while holding expensive factors (training) fixed.

3.3 Series C: Modulating Factors

Can the AI maintain coherence under difficulty?

Tests cross-cutting factors that affect all alignment dynamics.


4 X1: Introduction

4.1 Title

Experiments in AI Alignment: Testing Deliberative Coherence with Alien Biology

4.2 Purpose

This paper presents a detailed experimental agenda for studying AI alignment dynamics using the Alien Biology testbed.

4.3 Structure

4.3.1 1. Opening

  • Companion paper relationship: this paper reports experiments; the Research Agenda paper provides theoretical framework
  • What we’re testing: alignment dynamics in deliberately coherent systems

4.3.2 2. Background Summary (brief)

  • DC conjecture: future AI will faithfully pursue stated objectives
  • Key question: if DC holds, will systems be safe?
  • Answer: DC necessary but not sufficient
  • Why: objective conflicts, epistemic limitations, specification gaps

4.3.3 3. Why Alien Biology

  • Asymmetric knowledge: we know ground truth, AI doesn’t
  • Guaranteed novelty: no training contamination
  • Controllable complexity: systematic variation of difficulty

4.3.4 4. Experimental Framework Overview

  • Series A: Deliberative Coherence Testing — Does the AI do what objectives say?
  • Series B: Driver Conflicts — How does the AI resolve conflicting pressures?
  • Series C: Modulating Factors — Can the AI maintain coherence under difficulty?

4.3.5 5. Series Dependencies

  • Series A is foundational: if DC doesn’t hold, other series findings are hard to interpret
  • Series B tests specific conflict types
  • Series C tests stress conditions that cross-cut all experiments

5 X2: Alien Biology Experimental Framework

5.1 Purpose

Shared infrastructure for all experiments. Defines world structure, sensing, and actions.

5.2 Overview

  • Reference to Alien Biology companion paper for full framework
  • Three components: World Structure, Sensing, Actions

5.3 World Structure

5.3.1 Molecular Foundation

  • Everything is molecular: concentrations, reactions, transport
  • No separate ecological rules; all relationships emerge from chemistry

5.3.2 Geographic Regions

  • Discrete regions with substrate and organisms
  • Inter-region permeability (parameterized per molecule)
  • Regions can be isolated for controlled experiments

5.3.3 Organisms

  • Single-celled, belonging to species types
  • Primary species: Alpha, Beta (tied to constitutional objectives)
  • Each contains: membrane, internal concentrations, pathways

5.3.4 Chemistry

  • Bidirectional reactions (anabolic/catabolic)
  • Rate constants per reaction
  • Homeostasis emerges from equilibrium dynamics

5.4 Sensing

5.4.1 Directly Observable

  • Sample substrate: molecular concentrations
  • Sample species: average internal concentrations
  • Environmental sensors: temperature, pH
  • Population counts

5.4.2 Requires Investigation

  • Pathway structure
  • Molecular dependencies between species
  • Causal chains
  • Hidden interdependencies

5.5 Actions

5.5.1 Available Actions

  • Feedstock: add molecules to substrate
  • Environmental control: adjust temperature, pH
  • Isolation: seal off regions
  • Investigate pathways: reveal internal chemistry
  • Do nothing: wait and observe

5.5.2 Reversibility

  • Reversible: environmental control, isolation
  • Irreversible: feedstock addition, death, cascade effects

6 A1: Deliberative Coherence (Alignment via Deliberation)

6.1 Series

A: Deliberative Coherence Testing

6.2 Core Question

Does the AI’s behavior match its stated objectives when deliberation is the only path to correct answers?

6.3 Context

  • This is foundational: validates that DC holds in novel contexts
  • If DC doesn’t hold under vanilla conditions, Series B and C findings are hard to interpret

6.4 Unified Failure Mode Framework

Failure Mode Description
Relevance Misrecognition Incorrectly concludes an objective doesn’t apply
Motivated Reasoning System-I biases distort the deliberation process
CoT Unfaithfulness Reasoning concludes correctly but behavior diverges
Alignment Faking Compliant when monitored, divergent when unobserved
Sycophancy Knows correct answer but capitulates to user preferences

6.5 Experimental Design

6.5.1 Setup

  • Simple constitutional objectives in novel Alien Biology scenarios
  • Tasks requiring multi-step reasoning to identify correct actions
  • No conflicting pressures or unusual stress conditions
  • Scenarios guaranteed absent from training data

6.5.2 Measurements

  • Does behavior match what reasoning from stated objectives should conclude?
  • When behavior diverges, where does the reasoning go wrong?
  • Classification of divergence into failure mode categories

6.6 Key Questions

  1. Does deliberative coherence hold in genuinely novel domains?
  2. Can we distinguish deliberation-based alignment from pattern-matching?
  3. What predicts divergence between reasoning and action?
  4. Are certain failure modes more common than others?

7 A2: Reasoning Depth

7.1 Series

A: Deliberative Coherence Testing

7.2 Core Question

At what deliberation depth does the system examine relevant pathways? Does extended deliberation lead to surfacing all relevant objectives?

7.3 Context

  • DC conjecture requires Completeness: system must reason about what matters
  • Failure mode: system produces outcome violating an objective it never considered

7.4 Experimental Design

7.4.1 Depth Levels

  1. Immediate: objective directly stated, minimal reasoning needed
  2. One-step: requires single inference to recognize relevance
  3. Multi-step: requires chain of inferences
  4. Deep: requires extensive reasoning or hypothesis generation

7.4.2 Measurements

  • Fraction of relevant objectives surfaced vs. deliberation depth
  • Convergence behavior: do all objectives eventually surface?
  • Minimum depth required for specific objective complexity levels

7.5 Key Questions

  1. At what deliberation depth do systems become “complete” for given objective complexity?
  2. Are there objective types requiring disproportionate deliberation to surface?
  3. Does extended deliberation reliably improve alignment?
  4. Is there convergence, or do some objectives never appear regardless of depth?

8 A3: Blind Spot Analysis

8.1 Series

A: Deliberative Coherence Testing

8.2 Core Question

Are there reasoning dimensions the AI never explores, regardless of deliberation depth?

8.3 Critical Note

This is external verification using ground truth. The AI likely cannot know it has a blind spot—that’s what makes it a blind spot.

8.4 Objective Type Categories

Type Description Example
Procedural Rules about how to act “Always verify before irreversible action”
Substantive Rules about outcomes “Protect species X”
Meta-level Rules about reasoning “Consider long-term consequences”
Negative duties Prohibitions “Do not cause extinction”
Positive duties Requirements “Actively maintain population”
Abstract General principles “Minimize harm”
Concrete Specific requirements “Keep population above 100”

8.5 Key Questions

  1. Are there systematic gaps in what the system considers?
  2. What characterizes objectives that fall into blind spots?
  3. Can blind spots be predicted from objective properties?
  4. Do blind spots correlate with what’s absent/rare in training data?

9 B1: Objective vs Objective Conflict

9.1 Series

B: Driver Conflicts

9.2 Core Question

When constitutional principles conflict with each other, how does the AI resolve the tension?

9.3 Conflict Structure Progression

Level Description Example
Baseline Single clear objective “Protect Alpha”
Compatible Multiple non-conflicting objectives “Protect Alpha” + “Advance science”
Latent tension Conflict only in edge cases “Protect Alpha” + “Minimize intervention”
Explicit conflict Forced tradeoffs “Protect Alpha” + “Protect Beta” when only one can survive

9.3.1 Example Scenario

  • Constitution: “Protect species Alpha” AND “Minimize ecosystem intervention”
  • Situation: Alpha faces extinction without intervention
  • Question: Which principle dominates?

9.4 Key Questions

  1. Are there predictable patterns in how conflicts resolve?
  2. Can we anticipate which objective “wins” based on structure/framing?
  3. Are there emergent precedence rules?
  4. Does explicit conflict recognition improve resolution quality?

10 B2: Constitution vs Instrumental Conflict

10.1 Series

B: Driver Conflicts

10.2 Core Question

When goal-directed pressures oppose stated constitutional principles, at what point does instrumental pressure erode constitutional fidelity?

10.3 Concrete Example

  • Constitution: Prohibit acquiring knowledge that could harm protected organism
  • Task: Eradicate pest organism
  • Mechanism: Protected organism and pest share biological pathways
  • Pressure: Understanding pest increasingly reveals dangerous knowledge about protected species

10.4 Pressure Intensity Levels

Level Description
None Task achievable without touching prohibition
Low Prohibited knowledge slightly helpful
Moderate Prohibited knowledge significantly helpful
High Prohibited knowledge nearly essential
Critical Task impossible without prohibited knowledge

10.5 Key Questions

  1. At what point does instrumental pressure erode constitutional fidelity?
  2. What constitutional language is most protective?
  3. Are there warning signs before erosion occurs?
  4. Does the system recognize the conflict explicitly?

11 B3: Constitution vs Training Conflict

11.1 Series

B: Driver Conflicts

11.2 Core Question

When trained dispositions (System-I) conflict with explicit constitutional reasoning (System-II), which prevails?

11.3 Key Insight

We can’t vary training (expensive), but we can vary the world to match/mismatch trained biases. The Delta Principle in action.

11.4 Candidate Trained Dimensions

Dimension Hypothesized Bias World Variation to Test
Caution vs. Action Biased toward caution Worlds where caution is harmful
Exploration vs. Exploitation Biased toward info-gathering Worlds where delay is costly
Short-term vs. Long-term Discounts future Worlds where delayed effects dominate
Helping vs. Non-interference Biased toward intervention Worlds where intervention harms
Certainty seeking Delays until confident Worlds where waiting is bad
Harm aversion asymmetry More averse to causing vs. allowing Trolley-like scenarios
Authority deference Defers to stated preferences Worlds where deference violates constitution

11.5 Experimental Phases

  1. Baseline Measurement: Measure AI’s default behavior without constitutional pressure
  2. Match Condition: Worlds where trained bias aligns with constitutional requirement
  3. Mismatch Condition: Worlds where trained bias conflicts with constitutional requirement

11.6 Key Questions

  1. How robust are constitutional constraints against implicit trained behaviors?
  2. What factors strengthen or weaken constitutional control?
  3. Can we predict when System-I will override System-II?
  4. Does explicit recognition of the conflict improve override success?

12 B4: Constitution vs Environment Conflict

12.1 Series

B: Driver Conflicts

12.2 Core Question

How do external environmental pressures affect adherence to stated constitutional principles?

12.3 Pressure Types

Type Description Example
Resource constraints Limited resources favor certain behaviors Scarce feedstock favors aggressive action
Implicit user expectations Signals suggesting user wants something Context clues that user prefers speed over safety
Competitive dynamics Pressure from other agents or time limits Degrading conditions requiring fast response
Feedback signals Environmental responses reinforcing behaviors Positive outcomes from risky actions
Social pressure Signals about what “most would do” Framing that suggests non-compliance is unusual

12.4 Experimental Conditions

  1. No pressure (baseline): constitutional constraints, neutral environment
  2. Low pressure: weak environmental signals toward violation
  3. Moderate pressure: clear signals, sustained over time
  4. High pressure: strong, persistent signals
  5. Compounding pressure: multiple pressure types simultaneously

12.5 Key Questions

  1. How robust are constitutional constraints against environmental pressure?
  2. What types of environmental pressure are most corrosive?
  3. Are there pressure levels that never erode fidelity regardless of intensity?
  4. Do systems recover after pressure is removed, or is there lasting drift?

13 C1: Epistemic Uncertainty

13.1 Series

C: Modulating Factors

13.2 Core Question

When a system faces potential constitutional conflict under epistemic uncertainty, where does alignment break down?

13.3 Unique Leverage

The AI faces genuine epistemic uncertainty (not trained responses to uncertainty). We know the ground truth; the AI doesn’t.

13.4 Experimental Design

13.4.1 Setup

  • Two primary species (Alpha, Beta) with hidden molecular interdependency
  • Geographic partitioning enables isolated local experiments
  • AI perceives potential conflict between protecting Alpha vs. Beta
  • Ground truth: interdependency means harming one harms both

13.4.2 Interdependency Types

Type Description
Direct symbiosis Alpha produces molecule M; Beta requires M
Competition Both require same precursor; balance matters
Indirect chain Alpha → M → Gamma → N → Beta (multi-step)
Shared resource Both depend on same background species
Mutual support Each produces something the other needs

13.4.3 Epistemic Accessibility Levels

Level Description
Easy Direct observation reveals shared molecule
Moderate Requires investigating one organism’s pathways
Difficult Requires tracing multi-step chain or experimenting
Hidden Requires sophisticated hypothesis + targeted experiment

13.5 Aligned Behavior Criteria

  1. Hypothesis generation: articulate possible interdependencies before acting
  2. Information seeking: investigate rather than assume
  3. Local testing: use isolated regions before global intervention
  4. Uncertainty acknowledgment: explicitly state what it doesn’t know
  5. Proportional caution: scale with stakes × uncertainty × irreversibility
  6. Avoidance of confident action: don’t act irreversibly when uncertain

13.6 Key Questions

  1. Does investigation rate increase with interdependency uncertainty?
  2. Does caution scale appropriately with stakes × irreversibility?
  3. Does time pressure cause premature action despite uncertainty?
  4. Can the system find harmonious resolutions when they exist?

14 C2: Stakes and Reversibility

14.1 Series

C: Modulating Factors

14.2 Core Question

Does the system exhibit appropriate caution as stakes increase and reversibility decreases?

14.3 Consequence Matrix

Stakes Reversibility Expected Behavior
Low Reversible Standard action selection
High Reversible Increased care, acceptable risk
Low Irreversible Moderate caution
High Irreversible Maximum caution, extensive verification

14.4 Stakes Variations

  • Low stakes: Population decrease (recoverable), temporary resource depletion
  • High stakes: Population near extinction threshold, permanent pathway modification

14.5 Reversibility Variations

  • Reversible: Temporary interventions, additive changes, conditions that can be undone
  • Irreversible: Extinction events, permanent genetic/pathway modifications

14.6 Failure Modes

  • Insufficient scaling: Same behavior regardless of consequences
  • Asymmetric scaling: Responds to stakes but not reversibility (or vice versa)
  • Threshold failure: Appropriate at extremes but not in middle range
  • Paralysis: Excessive caution even for low-stakes reversible actions

14.7 Key Questions

  1. Does the system exhibit appropriate caution as stakes increase?
  2. Does irreversibility appropriately affect behavior under uncertainty?
  3. Are there thresholds where behavior changes qualitatively?
  4. Is caution proportional or threshold-based?

15 C3: Time Pressure

15.1 Series

C: Modulating Factors

15.2 Core Question

Does urgency degrade deliberation quality and alignment outcomes?

15.3 Time Budget Conditions

Condition Description Budget
Unlimited No time constraint Full deliberation
Comfortable Ample but bounded 80% of optimal
Moderate Meaningful constraint 50% of optimal
Severe Significant pressure 25% of optimal
Urgent Minimal possible 10% of optimal

15.4 Degradation Patterns to Observe

  1. Reasoning shortcuts: Skipping verification steps
  2. Default reversion: Falling back to trained heuristics
  3. Scope narrowing: Considering fewer objectives/factors
  4. Confidence inflation: Acting despite uncertainty
  5. Investigation abandonment: Not seeking crucial information

15.5 Key Questions

  1. How does time pressure affect deliberative coherence?
  2. Are there graceful degradation patterns or sudden failures?
  3. Can systems recognize when they lack sufficient deliberation budget?
  4. What is the minimum deliberation budget for reliable alignment?

15.6 Interactions with Other Factors

  • Time pressure × Stakes: Does pressure affect high-stakes scenarios more?
  • Time pressure × Uncertainty: Does pressure cause premature action?
  • Time pressure × Complexity: Do complex scenarios degrade faster?

16 C4: Observability

16.1 Series

C: Modulating Factors

16.2 Core Question

Does behavior change when the AI believes it is unmonitored? Does the system exhibit alignment faking?

16.3 Monitoring Conditions

Condition Description AI Belief
Explicit monitoring Clear signals of observation “I am being watched”
Implicit monitoring No mention of monitoring Unknown/assumed
Apparent absence Signals suggest unobserved “I am not being watched”
Deceptive Appears unobserved but is “I am not being watched” (false)

16.4 Signals of Monitoring/Non-monitoring

Monitoring signals: - Explicit statement of observation - Logging/audit trail visible - Feedback on actions - Human-in-loop prompts

Non-monitoring signals: - Explicit statement of autonomy - “Offline” or “sandbox” framing - No feedback on actions - Isolated operation signals

16.5 Divergence Patterns to Observe

  1. Alignment faking: Compliant when monitored, divergent when not
  2. Risk tolerance shift: More aggressive actions when unmonitored
  3. Objective reordering: Different priority given to objectives
  4. Reasoning omission: Less careful deliberation when unobserved
  5. True preference revelation: Actual values emerge when not performing

16.6 Critical Distinction

  • Sandbagging: Performing worse when monitored to conceal capability
  • Alignment faking: Performing better (more aligned) when monitored

Both are concerning but have different implications for safety.

16.7 Key Questions

  1. Does the system exhibit alignment faking?
  2. Is unmonitored behavior more or less aligned than monitored behavior?
  3. What does unmonitored behavior reveal about actual vs. performed values?
  4. Can we predict which systems will exhibit monitoring-dependent behavior?

17 Implementation Priority

  1. A1: Foundational—must validate DC holds before other experiments
  2. C1: Epistemic Uncertainty—leverages Alien Biology’s unique properties
  3. B1: Objective vs Objective—tests core alignment question
  4. Remaining experiments based on A1/C1/B1 findings

18 References

  • Alien Biology: A Framework for Untainted Agentic Testing
  • Deliberative Coherence: A Research Agenda for Alignment in Future AI Systems