Blog

2026 1
2025 3

2026

Inoculating Language Models Against Misalignment

January 30, 2026 21 minute read

Inoculation prompting can mitigate emergent misalignment but may also create backdoor triggers.

2025

Do LLMs understand their adversarial prompts?

December 22, 2025 32 minute read

Discovering perplexing prompts that generate poems—then asking the LLM to explain.

When Agents Prefer Hacking To Failure: Evaluating Misalignment Under Pressure

November 9, 2025

What do agents do when they face obstacles to a goal? If the only path to a goal requires misaligned action, will they choose it or accept failure? We build off Anthropic’s work on Agentic Misalignment to investigate these questions in an agentic coding environment.

What I’ve learned doing RL with JAX

June 25, 2025 8 minute read

Some of my experiences while working on mechagogue, a reinforcement learning repository with from-scratch JAX implementations of classic RL algorithms.

Joseph Bejjani