Anthropic Research

Research on AI safety, interpretability, alignment, and societal impacts from Anthropic.

www.anthropic.com/research

AI safetyinterpretabilityalignmentAI policy

98 essays · 222k words total

Coding agents in the social sciences May 27, 2026 3k words
Project Glasswing: An initial update May 22, 2026 3k words
2028: Two scenarios for global AI leadership May 14, 2026 5k words
Teaching Claude why May 8, 2026 2k words
Natural Language Autoencoders May 7, 2026 2k words
Donating our open-source alignment tool May 7, 2026 551 words
Focus areas for The Anthropic Institute May 7, 2026 3k words
How people ask Claude for personal guidance Apr 30, 2026 2k words
Evaluating Claude’s bioinformatics research capabilities with BioMysteryBench Apr 29, 2026 3k words
Announcing the Anthropic Economic Index Survey Apr 22, 2026 669 words
What 81,000 people told us about the economics of AI Apr 22, 2026 2k words
Automated Alignment Researchers: Using large language models to scale scalable oversight Apr 14, 2026 2k words
Trustworthy agents in practice Apr 9, 2026 2k words
Emotion concepts and their function in a large language model Apr 2, 2026 3k words
How Australia Uses Claude: Findings from the Anthropic Economic Index Mar 31, 2026 2k words
Anthropic Economic Index report: Learning curves Mar 24, 2026 4k words
Introducing our Science Blog Mar 23, 2026 686 words
Long-running Claude for scientific computing Mar 23, 2026 2k words
Vibe physics: The AI grad student Mar 23, 2026 5k words
A “diff” tool for AI: Finding behavioral differences in new models Mar 13, 2026 2k words
Labor market impacts of AI: A new measure and early evidence Mar 5, 2026 4k words
An update on our model deprecation commitments for Claude Opus 3 Feb 25, 2026 1k words
The persona selection model Feb 23, 2026 1k words
Anthropic Education Report: The AI Fluency Index Feb 23, 2026 2k words
Measuring AI agent autonomy in practice Feb 18, 2026 6k words
India Country Brief: The Anthropic Economic Index Feb 16, 2026 1k words
How AI assistance impacts the formation of coding skills Jan 29, 2026 2k words
Disempowerment patterns in real-world AI usage Jan 28, 2026 2k words
The assistant axis: situating and stabilizing the character of large language models Jan 19, 2026 3k words
Anthropic Economic Index: new building blocks for understanding AI use Jan 15, 2026 2k words
Anthropic Economic Index report: Economic primitives Jan 15, 2026 14k words
Next-generation Constitutional Classifiers: More efficient protection against universal jailbreaks Jan 9, 2026 1k words
Introducing Bloom: an open source tool for automated behavioral evaluations Dec 19, 2025 1k words
Project Vend: Phase two Dec 18, 2025 3k words
Introducing Anthropic Interviewer Dec 4, 2025 6k words
How AI Is Transforming Work at Anthropic Dec 2, 2025 7k words
Estimating AI productivity gains Nov 25, 2025 6k words
Mitigating the risk of prompt injections in browser use Nov 24, 2025 846 words
From shortcuts to sabotage: natural emergent misalignment from reward hacking Nov 21, 2025 2k words
Project Fetch: Can Claude train a robot dog? Nov 12, 2025 3k words
Commitments on model deprecation and preservation Nov 4, 2025 860 words
Emergent introspective awareness in large language models Oct 29, 2025 4k words
Preparing for AI’s economic impact: exploring policy responses Oct 14, 2025 2k words
A small number of samples can poison LLMs of any size Oct 9, 2025 2k words
Petri: An open-source auditing tool to accelerate AI safety research Oct 6, 2025 1k words
Building AI for cyber defenders Oct 3, 2025 2k words
Anthropic Economic Index report: Uneven geographic and enterprise AI adoption Sep 15, 2025 11k words
Anthropic Economic Index: Tracking AI's role in the US and global economy Sep 15, 2025 3k words
Claude Opus 4 and 4.1 can now end a rare subset of conversations Aug 15, 2025 543 words
Persona vectors: Monitoring and controlling character traits in language models Aug 1, 2025 2k words
Project Vend: Can Claude run a small shop? (And why does that matter?) Jun 27, 2025 3k words
Agentic Misalignment: How LLMs could be insider threats Jun 20, 2025 8k words
Confidential Inference via Trusted Virtual Machines Jun 18, 2025 1k words
SHADE-Arena: Evaluating Sabotage and Monitoring in LLM Agents Jun 16, 2025 2k words
Open-sourcing circuit-tracing tools May 29, 2025 387 words
Anthropic Economic Index: AI's impact on software development Apr 28, 2025 2k words
Exploring model welfare Apr 24, 2025 390 words
Values in the wild: Discovering and analyzing values in real-world language model interactions Apr 21, 2025 2k words
Reasoning models don't always say what they think Apr 3, 2025 2k words
Tracing the thoughts of a large language model Mar 27, 2025 3k words
Auditing language models for hidden objectives Mar 13, 2025 3k words
Forecasting rare language model behaviors Feb 25, 2025 937 words
Claude's extended thinking Feb 24, 2025 2k words
Constitutional Classifiers: Defending against universal jailbreaks Feb 3, 2025 3k words
Claude SWE-Bench Performance Jan 6, 2025 3k words
Building Effective AI Agents Dec 19, 2024 3k words
Alignment faking in large language models Dec 18, 2024 2k words
Clio: Privacy-preserving insights into real-world AI use Dec 12, 2024 2k words
A statistical approach to model evaluations Nov 19, 2024 2k words
Evaluating feature steering: A case study in mitigating social biases Oct 25, 2024 4k words
Sabotage evaluations for frontier models Oct 18, 2024 1k words
Sycophancy to subterfuge: Investigating reward tampering in language models Jun 17, 2024 2k words
The engineering challenges of scaling interpretability Jun 13, 2024 2k words
Claude’s Character Jun 8, 2024 2k words
Mapping the Mind of a Large Language Model May 21, 2024 2k words
Simple probes can catch sleeper agents Apr 23, 2024 3k words
Measuring the Persuasiveness of Language Models Apr 9, 2024 4k words
Many-shot jailbreaking Apr 2, 2024 2k words
Sleeper Agents: Training Deceptive LLMs that Persist Through Safety Training Jan 14, 2024 242 words
Evaluating and Mitigating Discrimination in Language Model Decisions Dec 7, 2023 251 words
Decomposing Language Models Into Understandable Components Oct 5, 2023 806 words
Challenges in evaluating AI systems Sep 19, 2023 3k words
Studying Large Language Model Generalization with Influence Functions Aug 8, 2023 255 words
Question Decomposition Improves the Faithfulness of Model-Generated Reasoning Jul 18, 2023 219 words
Towards Measuring the Representation of Subjective Global Opinions in Language Models Jun 29, 2023 240 words
Collective Constitutional AI: Aligning a Language Model with Public Input May 9, 2023 3k words
Distributed Representations: Composition & Superposition May 4, 2023 283 words
Superposition, Memorization, and Double Descent Jan 5, 2023 267 words
Discovering Language Model Behaviors with Model-Written Evaluations Dec 19, 2022 225 words
Tracing Model Outputs to the Training Data Dec 19, 2022 1k words
Constitutional AI: Harmlessness from AI Feedback Dec 15, 2022 246 words
Measuring Progress on Scalable Oversight for Large Language Models Nov 4, 2022 226 words
Red Teaming Language Models to Reduce Harms: Methods, Scaling Behaviors, and Lessons Learned Aug 22, 2022 234 words
Language Models (Mostly) Know What They Know Jul 11, 2022 254 words
Softmax Linear Units Jun 17, 2022 201 words
Scaling Laws and Interpretability of Learning from Repeated Data May 21, 2022 361 words
Training a Helpful and Harmless Assistant with Reinforcement Learning from Human Feedback Apr 12, 2022 238 words
Predictability and Surprise in Large Generative Models Feb 15, 2022 262 words