Understanding Constitutional AI and Alignment
Understanding Constitutional AI and Alignment
Traditional Reinforcement Learning from Human Feedback (RLHF) relies on massive teams of human labelers to rate model responses. This is expensive, slow, and hard to scale.
**Constitutional AI (CAI)**, popularized by Anthropic, replaces human annotators with an automated critique-and-revision loop governed by a set of written principles (the 'constitution').
The Two Phases of Constitutional AI
CAI aligns models in two distinct phases:
### Phase 1: Supervised Learning (Critique & Revision) 1. The model is prompted to generate a response (which might contain harmful elements). 2. The model is then shown its response along with a rule from the constitution (e.g. 'Ensure the response is helpful, honest, and harmless') and asked to critique its own response. 3. Finally, the model is asked to revise its response based on the critique. 4. The revised responses are used to fine-tune the model in a supervised manner.
### Phase 2: Reinforcement Learning (RLAIF) Instead of humans scoring response pairs, a critique model compares two outputs and selects the one that adheres closest to the constitution. The model's choices generate preference data used to train a Reward Model, which then guides the target model's training via reinforcement learning algorithms like PPO or DPO.
About the author
Naledi Dlamini is a verified AI trainer on our platform. To schedule a 1-on-1 model training session with them, visit their profile in our directory.