Oct 25, 2024 · Our experiments on Alpaca-7B demonstrate that RePO improves the safety alignment and reduces the safety interference compared to baseline methods.
Oct 25, 2024 · To address this issue, we propose Rectified Policy Optimization (RePO), which replaces the average safety constraint with stricter (per prompt) ...
Oct 29, 2024 · Balancing helpfulness and safety (harmlessness) is a critical challenge in aligning large language models (LLMs).
Oct 29, 2024 · Enhancing Safety in Reinforcement Learning with Human Feedback via Rectified Policy Optimization. https://arxiv.org/abs/2410.19933.
RePO is a policy update mechanism driven by rectified policy gradients, which prioritize stricter safety violations over average ones.
Oct 28, 2024 · The paper introduces the Rectified Policy Optimization (RPO) algorithm for enhancing the safety of reinforcement learning agents. RPO builds on ...
Balancing helpfulness and safety (harmlessness) is a critical challenge in aligning large language models (LLMs). Current approaches often decouple these ...
Xin Liu, Honghao Wei, and Lei Ying. Enhancing Safety in Reinforcement Learning with Human Feedback via Rectified Policy Optimization. [arXiv] [Code] Xiyue Peng, ...
This work formalizes the safety concern of LLMs as an optimization task of maximizing the reward function while satisfying specified cost constraints and ...
In this paper, we study reinforcement learning from human feedback (RLHF) under an episodic Markov decision process with a general trajectory-wise reward model.