최대 1 분 소요

This is a collection of recent approaches and papers about super-alignment and relevant topics.

Key paper

  • Weak-to-Strong Generalization: Eliciting Strong Capabilities With Weak Supervision(OpenAI Blog, arxiv)
  • Improving Weak-to-Strong Generalization with Scalable Oversight and Ensemble Learning(arxiv)
  • Co-Supervised Learning: Improving Weak-to-Strong Generalization with Hierarchical Mixture of Experts(arxiv)

RL based approach

  • PPO, Proximal policy optimization algorithms(arxiv)
  • Deep reinforcement learning from human preferences(arxiv)
  • Learning to summarize from human feedback(arxiv)

  • Curry-DPO: Enhancing Alignment using Curriculum Learning & Ranked Preferences(arxiv)

  • Improving Reinforcement Learning from Human Feedback Using Contrastive Rewards(arxiv)

Principle

  • Understanding the Learning Dynamics of Alignment with Human Feedback(arxivg)
  • On the Essence and Prospect: An Investigation of Alignment Approaches for Big Models(arxiv)

Learning algorithm

  • Self-Play Fine-Tuning Converts Weak Language Models to Strong Language Models(arxiv)

  • Easy-to-Hard Generalization: Scalable Alignment Beyond Human Supervision(arxiv)

  • The Unreasonable Effectiveness of Easy Training Data for Hard Tasks(arxiv)

Other Approaches

  • Vision Superalignment: Weak-to-Strong Generalization for Vision Foundation Models(arxiv)
  • Tuna: Instruction Tuning using Feedback from Large Language Models(arxiv)
  • Rewards-in-Context: Multi-objective Alignment of Foundation Models with Dynamic Preference Adjustment(arxiv)
  • Rethinking Information Structures in RLHF: Reward Generalization from a Graph Theory Perspective(arxiv)
  • Weak-to-Strong Jailbreaking on Large Language Models(arxiv)

댓글남기기