🏆 Accepted to ICLR 2026

RE-PO: Robust Enhanced Policy Optimization for LLM Alignment

Robust preference optimization under noisy labels with EM-based reliability weighting.

Paper Code Citation

Xiaoyang Cao, Zelai Xu, Mo Guang, Kaiwen Long, Michiel A. Bakker, Yu Wang, Chao Yu

Massachusetts Institute of Technology · Tsinghua University · Li Auto Inc.

Key Contributions

🔴 The Problem

Preference labels in alignment data are often noisy and inconsistent, but many methods assume uniform reliability.

💡 Our Method (RE-PO)

We introduce an EM-style reliability-aware weighting mechanism for adaptive loss reweighting during optimization.

🚀 Key Results

Across Mistral-7B and Llama-3-8B, RE-PO improves AlpacaEval 2 LC and WR over DPO baselines.

Method Overview

RE-PO models observed preferences as potentially noisy signals. In the E-step, it estimates posterior confidence that each label is correct. In the M-step, it updates policy parameters and annotator reliability with these confidences as weights.

This keeps the training pipeline close to DPO while improving robustness against corrupted or inconsistent feedback.

Practical training uses mini-batch compatible reliability updates.

Flow chart of RE-PO EM loop from noisy labels to weighted policy optimization. — RE-PO alternates confidence estimation and weighted policy updates.

Key Results

Across DPO, IPO, SimPO, and CPO families, RE-PO generally improves LC and WR on UltraFeedback, and RE-DPO remains stronger than DPO on MultiPref.

Dataset	Model	Method	Standard (LC/WR)	RE-PO (LC/WR)	Delta LC	Delta WR

Noise Robustness

RE-PO estimated reliabilities track ground-truth trends under synthetic noise in both single-annotator and two-annotator settings.

Single-annotator reliability under varying synthetic noise: RE-PO estimate tracks ground-truth reliability. — Single annotator: estimated reliability follows the injected-noise trajectory.

Two-annotator reliability under synthetic noise: RE-PO estimate tracks one stable and one degrading annotator. — Two annotators: RE-PO separates a stable annotator from a progressively noisy annotator.

Citation

If RE-PO is useful to your research, please cite:

@inproceedings{cao2026repo,
  title     = {RE-PO: Robust Enhanced Policy Optimization as a General Framework for LLM Alignment},
  author    = {Cao, Xiaoyang and Xu, Zelai and Guang, Mo and Long, Kaiwen and Bakker, Michiel A. and Wang, Yu and Yu, Chao},
  booktitle = {International Conference on Learning Representations (ICLR)},
  year      = {2026}
}