최대 1 분 소요

This is a brief review for “Emergent Hierarchical Reasoning in LLMs through Reinforcement Learning (HICRA)”.
You can see the paper at this link.

Overview

This work argues that RL improves LLM reasoning through an emergent two‑phase hierarchy: early training fixes low‑level procedural tokens, then later gains come from high‑level strategic planning. Based on this, the authors introduce HICRA, which increases credit assignment on planning tokens (rather than all tokens as in GRPO), leading to stronger performance.

Key Ideas

  • Diagnoses ‘aha moments’ and length scaling as signs of an emergent planning hierarchy.
  • Hierarchy‑Aware Credit Assignment (HICRA) amplifies gradients on planning tokens.
  • Outperforms GRPO‑style baselines by targeting the true bottleneck—strategy.

Why it matters

Clarifies why RL helps reasoning and offers a targeted algorithm (HICRA) that can be plugged into many RL setups for further gains.

References

댓글남기기