less than 1 minute read

Key Concepts

  • Clip ratio and value loss balance.
  • Advantage normalization and reward scaling.
  • Early stopping based on KL divergence.

Common Pitfalls

  • Too large learning rate causes policy collapse.
  • Value function overfits when minibatches are too small.

Practical Checklist

  • Normalize advantages per batch.
  • Track KL and stop epochs if it spikes.
  • Use entropy bonus to avoid premature convergence.

Quick Reference

# Pseudocode for PPO update loop
for epoch in range(k_epochs):
    logp_new = policy(obs, act)
    ratio = exp(logp_new - logp_old)
    loss_clip = -min(ratio * adv, clip(ratio, 1-eps, 1+eps) * adv)
    loss = loss_clip + vf_coef * value_loss - ent_coef * entropy

Updated: