RLHF — Reinforcement Learning from Human Feedback

How AI learns to be helpful and safe

Definition

RLHF is a training technique used to align language models with human preferences. Human raters compare model outputs and rank them; these rankings train a reward model; the reward model then guides further fine-tuning via reinforcement learning. RLHF is a key reason ChatGPT and Claude feel more natural and safe than raw pre-trained models.

Related Tools

chatgpt claude

← Back to Glossary

RLHF — Reinforcement Learning from Human Feedback

Definition

Related Terms

Related Tools