AI Glossary
RLHF — Reinforcement Learning from Human Feedback
How AI learns to be helpful and safe
Definition
RLHF is a training technique used to align language models with human preferences. Human raters compare model outputs and rank them; these rankings train a reward model; the reward model then guides further fine-tuning via reinforcement learning. RLHF is a key reason ChatGPT and Claude feel more natural and safe than raw pre-trained models.