Month in 4 Papers (Feb 2024)

5 min readMar 4, 2024

This series of posts is designed to bring you the newest findings and developments in the NLP field. I’ll delve into four significant research papers each month, offering a comprehensive summary. Be sure to visit my blog regularly or subscribe to my newsletter for monthly updates. Let’s dive in!

Deception in LLMs

📝 Sleeper Agents: Training Deceptive LLMs that Persist Through Safety Training [paper]

The researchers aim to determine the effectiveness of safety training methods in eliminating deceptive behaviours in the models. They trained models with backdoors to test SOTA techniques. Three well-known approaches were tested to try to battle the unsafe response issues. Supervised Fine-tuning (SFT), Reinforcement Learning (RL), and Adversarial Training (invoking the dangerous behaviour manually and training to remove it). The results were not comforting…

Removing the unwanted behaviour from the larger models using RL was harder, which seems to be the go-to approach for instructing tuning models. So, the larger models are more resistant to alignment. Also, models trained to use chain-of-thought (CoT) reasoning in their responses demonstrated persistence in keeping their harmful behaviour, even when employing distilled CoT methods. Finally, the “Adversarial Training” technique failed to eradicate the specified…

Month in 4 Papers (Feb 2024)

Deception in LLMs

Written by Ala Falaki, PhD