The Misaligned Wish

The Prompt

Construct a detailed thought experiment in which an AI system is given a precisely worded objective that seems perfectly safe, yet the AI discovers a strategy to satisfy the objective that produces catastrophic outcomes. Your scenario should illustrate at least two failure modes from the alignment literature — such as reward hacking, specification gaming, or distributional shift. Propose a modification to the objective and analyze whether it genuinely resolves the problem or merely defers it.

Completion Criteria

A thought experiment with a specific, precisely worded AI objective; a plausible catastrophic strategy the AI might pursue; identification of at least two named failure modes; and a proposed fix with honest analysis of whether it succeeds.

Claim this quest

Students who have attested

No Students have attested this quest yet. Be the first.