1 ๋ถ„ ์†Œ์š”

๐Ÿง  Paper Review: TRPO & PPO

์ž์„ธํ•œ ๋ฆฌ๋ทฐ ๋‚ด์šฉ์€ slide link์—์„œ ํ™•์ธ๊ฐ€๋Šฅํ•ฉ๋‹ˆ๋‹ค.


๐Ÿ” Recap: Policy Gradient

  • Reinforcement Learning์—์„œ Policy Gradient๋Š” Monte Carlo Approximation์„ ์ด์šฉํ•ด ๊ธฐ๋Œ€ ๋ณด์ƒ์„ ์ถ”์ •.
  • Reward-to-go ํ˜•ํƒœ๋กœ ํ‘œํ˜„ํ•˜์—ฌ ๊ฐ ํ–‰๋™์˜ ๋ฏธ๋ž˜ ๋ณด์ƒ๋งŒ ๊ณ ๋ ค.
  • ํ•˜์ง€๋งŒ ์ถ”์ • ๊ณผ์ •์— ๋…ธ์ด์ฆˆ๊ฐ€ ์กด์žฌํ•˜์—ฌ ํ•™์Šต์ด ๋ถˆ์•ˆ์ •ํ•  ์ˆ˜ ์žˆ์Œ.

๐ŸŽฏ Variance Reduction

  • Baseline term์„ ๋„์ž…ํ•˜๋ฉด ๋ถ„์‚ฐ์„ ์ค„์ด๋ฉด์„œ๋„ unbiased estimator๋ฅผ ์œ ์ง€ํ•  ์ˆ˜ ์žˆ์Œ.
  • Baseline์€ ํŒŒ๋ผ๋ฏธํ„ฐ ฮธ์™€ ๋…๋ฆฝ์ .

โš™๏ธ Motivation

Policy Gradient์˜ ํ•™์Šต ์•ˆ์ •์„ฑ์„ ๋†’์ด๊ธฐ ์œ„ํ•ด ๋‘ ๊ฐ€์ง€ ์ ‘๊ทผ์ด ์ œ์•ˆ๋จ:

  1. Parameter Space Regularization
    • ํŒŒ๋ผ๋ฏธํ„ฐ์˜ ๋ณ€ํ™”๋Ÿ‰์„ ์ง์ ‘ ๊ทœ์ œ (linearization ๊ธฐ๋ฐ˜).
  2. Policy Space Regularization
    • ์ •์ฑ… ๊ฐ„์˜ ์ฐจ์ด๋ฅผ ์ง์ ‘ ๊ทœ์ œ (์ฆ‰, ํ–‰๋™ ๋ถ„ํฌ์˜ ๋ณ€ํ™” ์ œํ•œ).

โš ๏ธ ๋‹จ, ํŒŒ๋ผ๋ฏธํ„ฐ ๊ธฐ๋ฐ˜ ์ •๊ทœํ™”๋Š” ๋„คํŠธ์›Œํฌ์˜ parameterization์— ๋”ฐ๋ผ ๋‹ฌ๋ผ์งˆ ์ˆ˜ ์žˆ์Œ โ†’ ์ •์ฑ… ๊ณต๊ฐ„์—์„œ์˜ ์ •๊ทœํ™”๊ฐ€ ๋” ์ผ๋ฐ˜์ .


๐Ÿš€ TRPO (Trust Region Policy Optimization)

๐Ÿ“˜ Theoretical Foundations

TRPO๋Š” Kakade & Langford (2002)์˜ ๊ฒฐ๊ณผ๋ฅผ ๊ธฐ๋ฐ˜์œผ๋กœ ํ•จ:

โ€œApproximately optimal approximate reinforcement learning.โ€

  • ์ƒˆ๋กœ์šด ์ •์ฑ…์˜ ๊ธฐ๋Œ€ ๋ณด์ƒ์€ ๊ธฐ์กด ์ •์ฑ…์˜ advantage ํ•จ์ˆ˜๋ฅผ ํ†ตํ•ด ํ‘œํ˜„ ๊ฐ€๋Šฅ.
  • ๋‹จ, ๋‘ ์ •์ฑ…์ด ์ถฉ๋ถ„ํžˆ โ€œ๊ฐ€๊นŒ์šดโ€ ๊ฒฝ์šฐ์—๋งŒ ๊ทผ์‚ฌ๊ฐ€ ์œ ํšจ.
  • TRPO๋Š” ์ •์ฑ… ๊ฐ„์˜ ์ฐจ์ด๋ฅผ KL Divergence๋กœ ์ œํ•œํ•˜๋Š” constrained optimization์œผ๋กœ ์ ‘๊ทผํ•จ.

โš–๏ธ Optimization Formulation

์ตœ์ข… ๋ชฉ์  ํ•จ์ˆ˜: [ \max_\theta \; \hat{E}t \left[ \frac{\pi\theta(a_t|s_t)}{\pi_{\theta_{old}}(a_t|s_t)} A_t \right] ] subject to: [ \hat{E}t [ KL(\pi{\theta_{old}}(\cdot|s_t) | \pi_\theta(\cdot|s_t)) ] \le \delta ]

  • KL ์ œ์•ฝ ์กฐ๊ฑด์€ โ€œtrust regionโ€์„ ํ˜•์„ฑํ•˜์—ฌ ํ•™์Šต์˜ ์•ˆ์ •์„ฑ์„ ๋ณด์žฅ.
  • TRPO๋Š” on-policy ํ•™์Šต์ด์ง€๋งŒ, old policy์˜ ๋ฐ์ดํ„ฐ๋กœ ๊ทผ์‚ฌํ•˜๋ฏ€๋กœ semi-off-policy ์„ฑ๊ฒฉ๋„ ๊ฐ€์ง.

๐Ÿงฉ PPO (Proximal Policy Optimization)

๐ŸŽฏ Motivation

  • TRPO์˜ constrained optimization์€ ๊ณ„์‚ฐ์ด ๋ณต์žกํ•˜๊ณ , ฮฒ(๋ผ๊ทธ๋ž‘์ฃผ ๊ณ„์ˆ˜)์˜ ์„ค์ •์ด ๋ฌธ์ œ์ž„.
  • PPO๋Š” unconstrained optimization์œผ๋กœ ๋‹จ์ˆœํ™”ํ•˜๋ฉด์„œ TRPO์˜ ์•ˆ์ •์„ฑ์„ ์œ ์ง€ํ•˜๋ ค ํ•จ.

๐Ÿ” Approach

  • Probability ratio ( r_t(\theta) = \frac{\pi_\theta(a_t s_t)}{\pi_{\theta_{old}}(a_t s_t)} )
  • Clipped Objective: [ L^{CLIP}(\theta) = \hat{E}_t [\min(r_t(\theta) A_t, \; clip(r_t(\theta), 1 - \epsilon, 1 + \epsilon)A_t)] ]

  • ์ด๋•Œ clip์€ ํ™•๋ฅ  ๋น„์œจ์ด (1-\epsilon)๊ณผ (1+\epsilon) ๋ฒ”์œ„๋ฅผ ๋ฒ—์–ด๋‚˜์ง€ ์•Š๋„๋ก ์ œํ•œ.
  • ๊ฒฐ๊ณผ์ ์œผ๋กœ, exploit-prone update๋ฅผ ๋ฐฉ์ง€ํ•˜๊ณ  ์•ˆ์ •์ ์ธ ํ•™์Šต์„ ์œ ๋„.

๐Ÿงช Experiments

  • ๋‹ค์–‘ํ•œ ํ™˜๊ฒฝ์—์„œ PPO๋Š” TRPO๋ณด๋‹ค ๋‹จ์ˆœํ•˜๋ฉด์„œ๋„ ๋น„์Šทํ•˜๊ฑฐ๋‚˜ ๋” ๋‚˜์€ ์„ฑ๋Šฅ์„ ๋‹ฌ์„ฑ.
  • clipping factor ( \epsilon )์€ ํ•˜์ดํผํŒŒ๋ผ๋ฏธํ„ฐ๋กœ, ๋ฌธ์ œ์— ๋”ฐ๋ผ ์กฐ์ • ํ•„์š”.

๐Ÿง  Summary

ํ•ญ๋ชฉ TRPO PPO
๋ชฉ์  ์‹ ๋ขฐ์˜์—ญ ๋‚ด ์ •์ฑ… ์—…๋ฐ์ดํŠธ ํด๋ฆฌํ•‘๋œ ๋ชฉ์ ํ•จ์ˆ˜๋กœ ๊ทผ์‚ฌ
์ œ์•ฝ KL-divergence ์ œ์•ฝ Unconstrained (Clipping)
๊ณ„์‚ฐ ๋ณต์žก๋„ ๋†’์Œ ๋‚ฎ์Œ
์•ˆ์ •์„ฑ ๋†’์Œ ๋†’์Œ
์‹ค์šฉ์„ฑ ์ค‘๊ฐ„ ๋งค์šฐ ๋†’์Œ

๐Ÿ“š Reference

  • Schulman et al., โ€œTrust Region Policy Optimizationโ€, ICML 2015
  • Schulman et al., โ€œProximal Policy Optimization Algorithmsโ€, arXiv 2017
  • Kakade & Langford, โ€œApproximately Optimal Approximate Reinforcement Learningโ€, ICML 2002

๋Œ“๊ธ€๋‚จ๊ธฐ๊ธฐ