Inoculating Language Models Against Misalignment
Inoculation prompting can mitigate emergent misalignment but may also create backdoor triggers.
Inoculation prompting can mitigate emergent misalignment but may also create backdoor triggers.
Discovering perplexing prompts that generate poems—then asking the LLM to explain.
What do agents do when they face obstacles to a goal? If the only path to a goal requires misaligned action, will they choose it or accept failure? We build off Anthropic’s work on Agentic Misalignment to investigate these questions in an agentic coding environment.
Some of my experiences while working on mechagogue, a reinforcement learning repository with from-scratch JAX implementations of classic RL algorithms.