Second thoughts are best: Learning to re-align with human values from text edits
Abstract We present Second Thoughts, a new learning paradigm that enables language
models (LMs) to re-align with human values. By modeling the chain-of-edits between value-
unaligned and value-aligned text, with LM fine-tuning and additional refinement through
reinforcement learning, Second Thoughts not only achieves superior performance in three
value alignment benchmark datasets but also shows strong human-value transfer learning
ability in few-shot scenarios. The generated editing steps also offer better interpretability and …
models (LMs) to re-align with human values. By modeling the chain-of-edits between value-
unaligned and value-aligned text, with LM fine-tuning and additional refinement through
reinforcement learning, Second Thoughts not only achieves superior performance in three
value alignment benchmark datasets but also shows strong human-value transfer learning
ability in few-shot scenarios. The generated editing steps also offer better interpretability and …
[PDF][PDF] Second Thoughts are Best: Learning to Re-Align With Human Values from Text Edits-Appendix
In Figure A1, we show the procedure for converting the data samples in the alignment
datasets into training data of AEM (negative samples used in AIL are generated similarly). In
DP-inferred chain-of-edits (CoEs), we use a few special tokens to mark the editing
operations (with their position and content). Then our decipher module will translate these
special tokens into natural language. As the final step, we add a special token [SEP]
between Context+ Source and the ground truth Chain-of-Edits (CoEs) and Target, as a …
datasets into training data of AEM (negative samples used in AIL are generated similarly). In
DP-inferred chain-of-edits (CoEs), we use a few special tokens to mark the editing
operations (with their position and content). Then our decipher module will translate these
special tokens into natural language. As the final step, we add a special token [SEP]
between Context+ Source and the ground truth Chain-of-Edits (CoEs) and Target, as a …
Showing the best results for this search. See all results