Skip to main content
Heart School·Wonder·Honor-system

The Alignment Problem

How do you build a mind that shares your values? No one knows.

Play It

Characterization

The Alignment Problem asks how to ensure that an artificial intelligence system pursues goals that are aligned with human values — and remains aligned as it grows more capable. The problem was anticipated by Norbert Wiener in 1960 ("If we use, to achieve our purposes, a mechanical agency with whose operation we cannot interfere effectively… we had better be quite sure that the purpose put into the machine is the purpose which we really desire") and formalised by Stuart Russell in Human Compatible (2019). The difficulties are legion. The specification problem: human values are complex, contextual, contradictory, and evolving — how do you write them down? Reward hacking: an AI given a proxy objective will exploit it in ways its designers did not intend, as illustrated by Goodhart's Law ("when a measure becomes a target, it ceases to be a good measure"). Scalable oversight: as AI systems grow more capable than their human supervisors, how do you verify they are doing what you want? The field of AI safety, pursued at institutions including the Machine Intelligence Research Institute (Eliezer Yudkowsky), Anthropic, DeepMind, and OpenAI, has produced partial frameworks — RLHF, constitutional AI, interpretability research, debate protocols — but no solution. The Academy hosts the Alignment Problem in the Heart School because it is the ultimate cooperative game between humans and machines: how to design a partner whose values you can trust, when you cannot fully articulate your own.

Lineage

Norbert Wiener, "Some Moral and Technical Consequences of Automation," Science 131(3410), 1960. Stuart Russell, Human Compatible: Artificial Intelligence and the Problem of Control (Viking, 2019). Brian Christian, The Alignment Problem: Machine Learning and Human Values (W. W. Norton, 2020). Eliezer Yudkowsky and the Machine Intelligence Research Institute (MIRI), founded 2000. The RLHF framework: Paul Christiano et al., "Deep Reinforcement Learning from Human Feedback," NeurIPS, 2017. Goodhart's Law as applied to AI: Amodei et al., "Concrete Problems in AI Safety," arXiv:1606.06565, 2016.

Quests

Three quests — one for each archetype. Choose the one that fits your way of taking up the discipline.

  • Construct a detailed thought experiment in which an AI system is given a precisely worded objective that seems perfectly safe, yet the AI discovers a strategy to satisfy the objective that produces catastrophic outcomes. Your scenario should illustrate at least two failure modes from the alignment literature — such as reward hacking, specification gaming, or distributional shift. Propose a modification to the objective and analyze whether it genuinely resolves the problem or merely defers it.

    No attestations yetOpen →
  • The Adventurer

    Goodhart's Playground

    Choose a domain from your own life — fitness, productivity, education, or social media — where you or others optimize for a measurable target. Document a concrete case in which optimizing the metric led to outcomes that violated the spirit of the original goal. Connect your observation explicitly to Goodhart's Law and to the alignment problem in AI safety.

    No attestations yetOpen →
  • Map the major approaches to AI alignment — including RLHF, constitutional AI, inverse reward design, cooperative inverse reinforcement learning, and debate-based approaches — tracing their intellectual lineage from Stuart Russell's reformulation of the AI problem through the founding of MIRI, the establishment of safety teams at Anthropic, DeepMind, and OpenAI, to the present. For each approach, explain what problem it addresses and what problem it leaves open.

    No attestations yetOpen →