Glossary
RLHF (Reinforcement Learning from Human Feedback)
RLHF is a training technique that refines a model's behavior using human preference judgments. It is commonly used to make models more helpful, honest, and harmless — but the quality of alignment depends on the diversity and accuracy of the feedback.