Anthropic Research
Research on AI safety, interpretability, alignment, and societal impacts from Anthropic.
AI safetyinterpretabilityalignmentAI policy
- Labor market impacts of AI: A new measure and early evidence
- An update on our model deprecation commitments for Claude Opus 3
- The persona selection model
- Anthropic Education Report: The AI Fluency Index
- Measuring AI agent autonomy in practice
- India Country Brief: The Anthropic Economic Index
- How AI assistance impacts the formation of coding skills
- Disempowerment patterns in real-world AI usage
- The assistant axis: situating and stabilizing the character of large language models
- Anthropic Economic Index: new building blocks for understanding AI use
- Anthropic Economic Index report: Economic primitives
- Next-generation Constitutional Classifiers: More efficient protection against universal jailbreaks
- Introducing Bloom: an open source tool for automated behavioral evaluations
- Project Vend: Phase two
- Introducing Anthropic Interviewer
- How AI Is Transforming Work at Anthropic
- Estimating AI productivity gains
- Mitigating the risk of prompt injections in browser use
- From shortcuts to sabotage: natural emergent misalignment from reward hacking
- Project Fetch: Can Claude train a robot dog?
- Commitments on model deprecation and preservation
- Emergent introspective awareness in large language models
- Preparing for AI’s economic impact: exploring policy responses
- A small number of samples can poison LLMs of any size
- Petri: An open-source auditing tool to accelerate AI safety research
- Building AI for cyber defenders
- Anthropic Economic Index report: Uneven geographic and enterprise AI adoption
- Anthropic Economic Index: Tracking AI's role in the US and global economy
- Claude Opus 4 and 4.1 can now end a rare subset of conversations
- Persona vectors: Monitoring and controlling character traits in language models
- Project Vend: Can Claude run a small shop? (And why does that matter?)
- Agentic Misalignment: How LLMs could be insider threats
- Confidential Inference via Trusted Virtual Machines
- SHADE-Arena: Evaluating Sabotage and Monitoring in LLM Agents
- Open-sourcing circuit-tracing tools
- Anthropic Economic Index: AI's impact on software development
- Exploring model welfare
- Values in the wild: Discovering and analyzing values in real-world language model interactions
- Reasoning models don't always say what they think
- Tracing the thoughts of a large language model
- Auditing language models for hidden objectives
- Forecasting rare language model behaviors
- Claude's extended thinking
- Constitutional Classifiers: Defending against universal jailbreaks
- Claude SWE-Bench Performance
- Building Effective AI Agents
- Alignment faking in large language models
- Clio: Privacy-preserving insights into real-world AI use
- A statistical approach to model evaluations
- Evaluating feature steering: A case study in mitigating social biases
- Sabotage evaluations for frontier models
- Sycophancy to subterfuge: Investigating reward tampering in language models
- The engineering challenges of scaling interpretability
- Claude’s Character
- Mapping the Mind of a Large Language Model
- Simple probes can catch sleeper agents
- Measuring the Persuasiveness of Language Models
- Many-shot jailbreaking
- Sleeper Agents: Training Deceptive LLMs that Persist Through Safety Training
- Evaluating and Mitigating Discrimination in Language Model Decisions
- Decomposing Language Models Into Understandable Components
- Challenges in evaluating AI systems
- Studying Large Language Model Generalization with Influence Functions
- Question Decomposition Improves the Faithfulness of Model-Generated Reasoning
- Towards Measuring the Representation of Subjective Global Opinions in Language Models
- Collective Constitutional AI: Aligning a Language Model with Public Input
- Distributed Representations: Composition & Superposition
- Superposition, Memorization, and Double Descent
- Discovering Language Model Behaviors with Model-Written Evaluations
- Tracing Model Outputs to the Training Data
- Constitutional AI: Harmlessness from AI Feedback
- Measuring Progress on Scalable Oversight for Large Language Models
- Red Teaming Language Models to Reduce Harms: Methods, Scaling Behaviors, and Lessons Learned
- Language Models (Mostly) Know What They Know
- Softmax Linear Units
- Scaling Laws and Interpretability of Learning from Repeated Data
- Training a Helpful and Harmless Assistant with Reinforcement Learning from Human Feedback
- Predictability and Surprise in Large Generative Models