DPO (Direct Preference Optimization)

A training technique that aligns language models with human preferences without a separate reward model. DPO directly optimizes the model using pairs of preferred and dispreferred responses. It is simpler and more stable than RLHF while achieving comparable results.