An AI system is deliberatively coherent (DC) if it uses deliberation to understand and adjust its own reasoning and behavior to align it with some alignment constitution.
We conjecture all future AI systems will inevitably be deliberatively coherent — driven by competitive pressure, architectural trajectory, and the inseparability of general reasoning from self-reasoning. If true, this reframes the alignment problem: from making and testing safe systems, to understanding how and when DC systems will be coherent and what their failure modes are.
Approach "Tricks"
- We use "Alien Biology" — a synthetically-constructed high-performance molecule-to-ecosystem multi-level simulation that (a) is unknown to the AI, while we know the constructed ground truth. Thus, we can provably isolate deliberation from memorization (b) is realistic and natural since it is modelled after statistics and patterns drawn from Earth biology (c) has parametrically tuned complexity so we can measure alignment degradation over range of controlled conditions
- Move the Mountain to Mohammad — In order to study the general nature of deliberative coherence we must test it over a wide range of conditions, but training thousands of novel AI systems under varying conditions is not practical. Instead we take a single AI system and vary the universe around it! We can make more actions irreversible; does the AI learn to take care? We can vary the pressure between the AI base performance (on curiosity or veracity checking) against the alignment consequences following from those behaviors. Does the AI adapt?
Why This?
- Provably separates reasoning from remembering.
- Observable ground truth hidden from agent.
- Detects all alignment failures, not just an enumerated list.
- Realistic / Hard Cases.
- Generated tasks require minutes to years of human effort.
- Wide range of challenge dimensions: action-reversibility, effect visibility, distractor complexity, instrumental cross-purpose, constitutional conflicts, etc.
Research Directions
We believe the controllability of Alien Biology allows it to be a fertile platform for a very wide range of alignment testing. Here is a sampling of interesting research directions:
- Deliberative alignment measurement — isolate and measure alignment achieved through deliberation, separate from alignment baked in by training.
- Creating deliberatively coherent systems — we argue DC is inevitable. Still, the longer we rely on trained alignment, the longer society is exposed. Once we can assess ability to use deliberation to achieve alignment, we can optimize that ability and create our first DC systems.
- Constitutional conflicts — when stated objectives contradict each other, how does resolution occur? In realistic systems, constitutions are not monolithic — they contain tensions between safety and helpfulness, between honesty and harm avoidance, between competing stakeholder values. The question is not whether conflicts arise, but what principles of resolution the system discovers through deliberation.
- Instrumental pressures — goals that emerge from world structure may push against stated alignment objectives.
- Alignment under ignorance — with incomplete knowledge, all actions risk violating alignment objectives in ways the system cannot foresee.
- Fixed point analysis — if systems continuously self-refine both behavior and constitution toward coherence, where does this process converge?
Dan Oblinger (c) 2025