[Survey] Recent approaches about Super alignments
This is a collection of recent approaches and papers about super-alignment and relevant topics.
Key paper
- Weak-to-Strong Generalization: Eliciting Strong Capabilities With Weak Supervision(OpenAI Blog, arxiv)
- Improving Weak-to-Strong Generalization with Scalable Oversight and Ensemble Learning(arxiv)
- Co-Supervised Learning: Improving Weak-to-Strong Generalization with Hierarchical Mixture of Experts(arxiv)
RL based approach
- PPO, Proximal policy optimization algorithms(arxiv)
- Deep reinforcement learning from human preferences(arxiv)
-
Learning to summarize from human feedback(arxiv)
-
Curry-DPO: Enhancing Alignment using Curriculum Learning & Ranked Preferences(arxiv)
- Improving Reinforcement Learning from Human Feedback Using Contrastive Rewards(arxiv)
Principle
- Understanding the Learning Dynamics of Alignment with Human Feedback(arxivg)
- On the Essence and Prospect: An Investigation of Alignment Approaches for Big Models(arxiv)
Learning algorithm
-
Self-Play Fine-Tuning Converts Weak Language Models to Strong Language Models(arxiv)
-
Easy-to-Hard Generalization: Scalable Alignment Beyond Human Supervision(arxiv)
-
The Unreasonable Effectiveness of Easy Training Data for Hard Tasks(arxiv)
Other Approaches
- Vision Superalignment: Weak-to-Strong Generalization for Vision Foundation Models(arxiv)
- Tuna: Instruction Tuning using Feedback from Large Language Models(arxiv)
- Rewards-in-Context: Multi-objective Alignment of Foundation Models with Dynamic Preference Adjustment(arxiv)
- Rethinking Information Structures in RLHF: Reward Generalization from a Graph Theory Perspective(arxiv)
- Weak-to-Strong Jailbreaking on Large Language Models(arxiv)
댓글남기기