Google Scholar

Second thoughts are best: Learning to re-align with human values from text edits

R Liu, C Jia, G Zhang, Z Zhuang… - Advances in Neural …, 2022 - proceedings.neurips.cc

Abstract We present Second Thoughts, a new learning paradigm that enables language
models (LMs) to re-align with human values. By modeling the chain-of-edits between value-
unaligned and value-aligned text, with LM fine-tuning and additional refinement through
reinforcement learning, Second Thoughts not only achieves superior performance in three
value alignment benchmark datasets but also shows strong human-value transfer learning
ability in few-shot scenarios. The generated editing steps also offer better interpretability and …

Save Cite Cited by 30 Related articles All 9 versions View as HTML

[PDF] neurips.cc

[PDF][PDF] Second Thoughts are Best: Learning to Re-Align With Human Values from Text Edits-Appendix

R Liu, C Jia, G Zhang, Z Zhuang, TX Liu, S Vosoughi - proceedings.neurips.cc

In Figure A1, we show the procedure for converting the data samples in the alignment
datasets into training data of AEM (negative samples used in AIL are generated similarly). In
DP-inferred chain-of-edits (CoEs), we use a few special tokens to mark the editing
operations (with their position and content). Then our decipher module will translate these
special tokens into natural language. As the final step, we add a special token [SEP]
between Context+ Source and the ground truth Chain-of-Edits (CoEs) and Target, as a …

Save Cite Related articles View as HTML

Showing the best results for this search. See all results

Cite

Advanced search

Saved to My library

Second thoughts are best: Learning to re-align with human values from text edits

[PDF][PDF] Second Thoughts are Best: Learning to Re-Align With Human Values from Text Edits-Appendix