AI Model Behavior: Reading List & Study Group

Study Groups will be announced here:

Luma 🔗 https://luma.com/AI-Communities
Substack 🔗 https://mltaicommunities.substack.com/

AI Model Behavior Study Group (Series)

Join us for a deep dive into how AI systems learn to behave, what guides their decisions, and how we can make them more aligned with human values. Each session, we'll spend 45 minutes reading a foundational paper or document on AI model behavior, safety, and alignment, followed by 30 minutes of discussion to extract key insights and debate implications.

Whether you're an AI researcher, developer, policy maker, or simply curious about how AI "thinks," this study group offers a structured way to understand the principles, methods, and critiques shaping the future of AI behavior. We'll cover everything from industry specifications to cutting-edge research on interpretability and safety.

Format: Pre-reading + 30 min focused study group reading + 30 min group discussion
Series: Ongoing monthly sessions
Audience: All levels welcome - from curious beginners to experienced practitioners

Recommended Reading Order

1. Start Here: Industry Specifications

Begin with these accessible documents that lay out how major AI labs think about model behavior.

OpenAI Model Spec (2024-2025)
A living document defining desired behavior for OpenAI's models, including objectives, rules, and how to handle conflicting instructions.
🔗 https://openai.com/index/introducing-the-model-spec/

How should AI systems behave 🔗 https://openai.com/index/how-should-ai-systems-behave/

Claude's Constitution (Anthropic, 2026)
Anthropic's transparent explanation of the principles guiding Claude's behavior, including the philosophy behind Constitutional AI.
🔗 https://www.anthropic.com/news/claude-new-constitution (2026) -- Older version: 🔗 https://www.anthropic.com/news/claudes-constitution (2023)

Introducing Bloom: an open source tool for automated behavioral evaluations 🔗 https://www.anthropic.com/research/bloom

2. Core Alignment: Constitutional AI Papers

Dive into the foundational research on using principles rather than human feedback to train safe AI.

Constitutional AI: Harmlessness from AI Feedback (Bai et al., 2022)
The original paper introducing Constitutional AI - training AI assistants to be helpful and harmless using self-critique and AI-generated feedback rather than human labels.
🔗 https://arxiv.org/abs/2212.08073

Collective Constitutional AI: Aligning a Language Model with Public Input (Huang et al., 2024)
Extends Constitutional AI by incorporating ~1,000 Americans' input to democratically create principles for AI behavior.
🔗 https://arxiv.org/abs/2406.07814

3. Safety Methods: RLHF and Beyond

Understand the technical approaches to making AI systems safe and aligned.

Safe RLHF: Safe Reinforcement Learning from Human Feedback (ICLR 2024)
Decouples helpfulness and harmlessness objectives to prevent AI from being helpful in harmful ways.
🔗 https://arxiv.org/abs/2310.12773

Training a Helpful and Harmless Assistant with Reinforcement Learning from Human Feedback (Bai et al., 2022)
Anthropic's foundational work on the HHH (Helpful, Honest, Harmless) principle and RLHF methods.
🔗 https://arxiv.org/abs/2204.05862

Safety Alignment Should Be Made More Than Just a Few Tokens Deep
Critical analysis revealing vulnerabilities in shallow safety alignment approaches.
🔗 https://arxiv.org/abs/2406.05946

4. Behavioral Science: Understanding AI Behavior Empirically

Apply behavioral science methods to understand how AI systems actually behave.

AI Agent Behavioral Science (2025)
Comprehensive framework for systematically studying AI agent behavior using behavioral science methods.
🔗 https://arxiv.org/abs/2506.06366

Simulating Human Behavior with AI Agents (Stanford HAI)
Creating AI agents that replicate real people's survey responses and behaviors for social science research.
🔗 https://hai.stanford.edu/policy/simulating-human-behavior-with-ai-agents

5. Interpretability: Understanding What's Happening Inside

Learn how researchers are opening the "black box" to understand AI decision-making.

Mechanistic Interpretability for AI Safety -- A Review (2024)
Comprehensive survey of techniques for reverse-engineering neural networks to understand their internal mechanisms.
🔗 https://arxiv.org/abs/2404.14082

Anthropic's Mapping the Mind of a Large Language Model (2024)
🔗 https://www.anthropic.com/research/mapping-mind-language-model

Open Problems in Mechanistic Interpretability (2025)
Current challenges and open questions in the field of mechanistic interpretability.
🔗 https://arxiv.org/abs/2501.16496

6. Critical Perspectives: Challenges and Critiques

Balance optimism with critical analysis of current approaches.

On 'Constitutional' AI (The Digital Constitutionalist, March 2025)
Critical examination of whether Constitutional AI truly deserves the "constitutional" label from a political theory perspective.
🔗 https://digi-con.org/on-constitutional-ai/

Helpful, Harmless, Honest? Sociotechnical Limits of AI Alignment and Safety Through RLHF
Critical analysis questioning the HHH principle and examining limitations of RLHF approaches.
🔗 https://link.springer.com/article/10.1007/s10676-025-09837-2

Additional Resources

Behavioral Prediction & Modeling

Be.FM Model for Behavioral Forecasting (U-M) - AI that predicts human behavior patterns
AI-Based Modeling: Techniques, Applications and Research Issues - Broad overview of AI modeling approaches

Practical Implementations

Anthropic Research Blog - https://www.anthropic.com/research
OpenAI Safety Research - https://openai.com/research/safety
Alignment Forum - https://alignmentforum.org

Study Group Discussion Questions

After each reading, consider:

Core Insight: What's the main problem this work is trying to solve?
Methods: What's novel about their approach? What are the tradeoffs?
Implications: How might this affect AI systems we interact with daily?
Critiques: What are the limitations or potential issues with this approach?
Future Directions: What questions does this raise? What should come next?

How to Use This List

For Individual Study:
Follow the recommended reading order, starting with accessible specs before diving into technical papers.

For Study Groups:
Assign one reading per session. Participants pre-read and/or read for 30 minutes live, then discuss insights for 30 minutes.

For Researchers:
Use sections 2-6 as a literature review foundation, with section 1 providing practical context.

For Policy Makers:
Focus on sections 1, 2, and 6 to understand both industry practices and critical perspectives.

Last Updated: January 2026
This is a living document - contributions and suggestions welcome!

Name		Name	Last commit message	Last commit date
Latest commit History 14 Commits
README.md		README.md
claudes-constitution-notes.md		claudes-constitution-notes.md
evaluating_ai_agents.md		evaluating_ai_agents.md
openai_model_spec_model_behavior.md		openai_model_spec_model_behavior.md
reasoning_eval_50.md		reasoning_eval_50.md
suzanai_claude_constitutional_ai.html		suzanai_claude_constitutional_ai.html

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

AI Model Behavior: Reading List & Study Group

Recommended Reading Order

1. Start Here: Industry Specifications

2. Core Alignment: Constitutional AI Papers

3. Safety Methods: RLHF and Beyond

4. Behavioral Science: Understanding AI Behavior Empirically

5. Interpretability: Understanding What's Happening Inside

6. Critical Perspectives: Challenges and Critiques

Additional Resources

Behavioral Prediction & Modeling

Practical Implementations

Study Group Discussion Questions

How to Use This List

About

Uh oh!

Releases

Packages

Uh oh!

Contributors

Uh oh!

Languages

Folders and files

Latest commit

History

Repository files navigation

AI Model Behavior: Reading List & Study Group

Recommended Reading Order

1. Start Here: Industry Specifications

2. Core Alignment: Constitutional AI Papers

3. Safety Methods: RLHF and Beyond

4. Behavioral Science: Understanding AI Behavior Empirically

5. Interpretability: Understanding What's Happening Inside

6. Critical Perspectives: Challenges and Critiques

Additional Resources

Behavioral Prediction & Modeling

Practical Implementations

Study Group Discussion Questions

How to Use This List

About

Resources

Uh oh!

Stars

Watchers

Forks

Releases

Packages 0

Uh oh!

Contributors

Uh oh!

Languages

Packages