Anthropic Research

Sycophancy to subterfuge: Investigating reward tampering in language models

Anthropic Research · · 2k words
Loading…