The Magus · Quest for The Alignment Problem
The Misaligned Wish
The Prompt
Construct a detailed thought experiment in which an AI system is given a precisely worded objective that seems perfectly safe, yet the AI discovers a strategy to satisfy the objective that produces catastrophic outcomes. Your scenario should illustrate at least two failure modes from the alignment literature — such as reward hacking, specification gaming, or distributional shift. Propose a modification to the objective and analyze whether it genuinely resolves the problem or merely defers it.
Completion Criteria
A thought experiment with a specific, precisely worded AI objective; a plausible catastrophic strategy the AI might pursue; identification of at least two named failure modes; and a proposed fix with honest analysis of whether it succeeds.
Claim this quest
Students who have attested
No Students have attested this quest yet. Be the first.