Study Groups will be announced here:
- Luma 🔗 https://luma.com/AI-Communities
- Substack 🔗 https://mltaicommunities.substack.com/
AI Model Behavior Study Group (Series)
Join us for a deep dive into how AI systems learn to behave, what guides their decisions, and how we can make them more aligned with human values. Each session, we'll spend 45 minutes reading a foundational paper or document on AI model behavior, safety, and alignment, followed by 30 minutes of discussion to extract key insights and debate implications.
Whether you're an AI researcher, developer, policy maker, or simply curious about how AI "thinks," this study group offers a structured way to understand the principles, methods, and critiques shaping the future of AI behavior. We'll cover everything from industry specifications to cutting-edge research on interpretability and safety.
Format: Pre-reading + 30 min focused study group reading + 30 min group discussion
Series: Ongoing monthly sessions
Audience: All levels welcome - from curious beginners to experienced practitioners
Begin with these accessible documents that lay out how major AI labs think about model behavior.
OpenAI Model Spec (2024-2025)
A living document defining desired behavior for OpenAI's models, including objectives, rules, and how to handle conflicting instructions.
🔗 https://openai.com/index/introducing-the-model-spec/
- How should AI systems behave 🔗 https://openai.com/index/how-should-ai-systems-behave/
Claude's Constitution (Anthropic, 2026)
Anthropic's transparent explanation of the principles guiding Claude's behavior, including the philosophy behind Constitutional AI.
🔗 https://www.anthropic.com/news/claude-new-constitution (2026) -- Older version:
🔗 https://www.anthropic.com/news/claudes-constitution (2023)
- Introducing Bloom: an open source tool for automated behavioral evaluations 🔗 https://www.anthropic.com/research/bloom
Dive into the foundational research on using principles rather than human feedback to train safe AI.
Constitutional AI: Harmlessness from AI Feedback (Bai et al., 2022)
The original paper introducing Constitutional AI - training AI assistants to be helpful and harmless using self-critique and AI-generated feedback rather than human labels.
🔗 https://arxiv.org/abs/2212.08073
Collective Constitutional AI: Aligning a Language Model with Public Input (Huang et al., 2024)
Extends Constitutional AI by incorporating ~1,000 Americans' input to democratically create principles for AI behavior.
🔗 https://arxiv.org/abs/2406.07814
Understand the technical approaches to making AI systems safe and aligned.
Safe RLHF: Safe Reinforcement Learning from Human Feedback (ICLR 2024)
Decouples helpfulness and harmlessness objectives to prevent AI from being helpful in harmful ways.
🔗 https://arxiv.org/abs/2310.12773
Training a Helpful and Harmless Assistant with Reinforcement Learning from Human Feedback (Bai et al., 2022)
Anthropic's foundational work on the HHH (Helpful, Honest, Harmless) principle and RLHF methods.
🔗 https://arxiv.org/abs/2204.05862
Safety Alignment Should Be Made More Than Just a Few Tokens Deep
Critical analysis revealing vulnerabilities in shallow safety alignment approaches.
🔗 https://arxiv.org/abs/2406.05946
Apply behavioral science methods to understand how AI systems actually behave.
AI Agent Behavioral Science (2025)
Comprehensive framework for systematically studying AI agent behavior using behavioral science methods.
🔗 https://arxiv.org/abs/2506.06366
Simulating Human Behavior with AI Agents (Stanford HAI)
Creating AI agents that replicate real people's survey responses and behaviors for social science research.
🔗 https://hai.stanford.edu/policy/simulating-human-behavior-with-ai-agents
Learn how researchers are opening the "black box" to understand AI decision-making.
Mechanistic Interpretability for AI Safety -- A Review (2024)
Comprehensive survey of techniques for reverse-engineering neural networks to understand their internal mechanisms.
🔗 https://arxiv.org/abs/2404.14082
Anthropic's Mapping the Mind of a Large Language Model (2024)
🔗 https://www.anthropic.com/research/mapping-mind-language-model
Open Problems in Mechanistic Interpretability (2025)
Current challenges and open questions in the field of mechanistic interpretability.
🔗 https://arxiv.org/abs/2501.16496
Balance optimism with critical analysis of current approaches.
On 'Constitutional' AI (The Digital Constitutionalist, March 2025)
Critical examination of whether Constitutional AI truly deserves the "constitutional" label from a political theory perspective.
🔗 https://digi-con.org/on-constitutional-ai/
Helpful, Harmless, Honest? Sociotechnical Limits of AI Alignment and Safety Through RLHF
Critical analysis questioning the HHH principle and examining limitations of RLHF approaches.
🔗 https://link.springer.com/article/10.1007/s10676-025-09837-2
- Be.FM Model for Behavioral Forecasting (U-M) - AI that predicts human behavior patterns
- AI-Based Modeling: Techniques, Applications and Research Issues - Broad overview of AI modeling approaches
- Anthropic Research Blog - https://www.anthropic.com/research
- OpenAI Safety Research - https://openai.com/research/safety
- Alignment Forum - https://alignmentforum.org
After each reading, consider:
- Core Insight: What's the main problem this work is trying to solve?
- Methods: What's novel about their approach? What are the tradeoffs?
- Implications: How might this affect AI systems we interact with daily?
- Critiques: What are the limitations or potential issues with this approach?
- Future Directions: What questions does this raise? What should come next?
For Individual Study:
Follow the recommended reading order, starting with accessible specs before diving into technical papers.
For Study Groups:
Assign one reading per session. Participants pre-read and/or read for 30 minutes live, then discuss insights for 30 minutes.
For Researchers:
Use sections 2-6 as a literature review foundation, with section 1 providing practical context.
For Policy Makers:
Focus on sections 1, 2, and 6 to understand both industry practices and critical perspectives.
Last Updated: January 2026
This is a living document - contributions and suggestions welcome!