Skip to content

DPO (Direct Preference Optimization)

A training technique that aligns language models with human preferences without a separate reward model. DPO directly optimizes the model using pairs of preferred and dispreferred responses. It is simpler and more stable than RLHF while achieving comparable results.

Related terms

RLHF (Reinforcement Learning from Human Feedback)AlignmentPreference Optimization
← Back to glossary