Glossary

What is RLHF?

RLHF — reinforcement learning from human feedback — is a training technique that uses human preferences to steer a model toward responses people actually find helpful and appropriate.

← All glossary terms

RLHF stands for reinforcement learning from human feedback. It is a training technique used to align a language model's behaviour with human preferences — turning a model that merely predicts plausible text into one that follows instructions, is helpful, and avoids responses people would object to. It is one of the main reasons modern chat models feel cooperative and on-topic rather than like an autocomplete that has read the whole internet.

The process has three stages. First, a pretrained model is fine-tuned on examples of good responses. Then humans compare pairs of model outputs and indicate which they prefer; those judgements train a separate reward model that learns to score responses the way people would. Finally, reinforcement learning optimises the language model to produce outputs the reward model rates highly, nudging it toward the behaviour humans preferred without anyone having to write an explicit rule for every situation.

RLHF is primarily the concern of the labs that build foundation models rather than of teams applying them, but it shapes everything downstream. The alignment, refusal behaviour, tone, and instruction-following of the model you call are products of how it was trained with human feedback. Related and increasingly common variants — such as direct preference optimisation and feedback generated by AI rather than humans — pursue the same goal of aligning outputs to preferences more cheaply. A few organisations apply preference-based tuning to specialise a model's behaviour, but it's a heavyweight tool.

RLHF matters because it is the bridge between raw capability and usable behaviour: a model can be brilliant at predicting text and still be useless or harmful without alignment to what humans actually want. Understanding that a model's helpfulness and its guardrails come from this training also explains its limits — RLHF reflects the preferences of the people who provided feedback, can be inconsistent at the edges, and can sometimes make a model overly cautious or sycophantic. Knowing where a model's behaviour comes from is part of using it responsibly.

RelatedWhat is fine-tuning?

RelatedEvaluation & safety

RelatedResponsible AI

RelatedAI consulting

ReferenceThe applied-AI glossaryEvery term, defined for production — agents, RAG, evals, embeddings, and more.

ServiceAI consultingStrategy and production engineering in one continuous engagement.

From definition to deployment

Understanding the term is step one. Bring us the problem and we'll build the system that solves it — and prove it moved the number.

Start a conversation

See our work