When Agents Prefer Hacking To Failure: Evaluating Misalignment Under Pressure
What do agents do when they face obstacles to their goals? If the only path to a goal requires a harmful action, will they choose harm or accept failure? We build off Anthropic’s work on Agentic Misalignment to investigate these questions in an agentic coding environment.