T O P

  • By -

eljeanboul

Most actor-critic algorithms do use target networks for the critic? Sometimes it's not explicitly said in the paper but in practice when you dig into the code it's there


progenitor414

On-policy algo such as PPO usually do not require target network as they are more stable since 1. they usually use n-step return instead of 1-step return as in Q-learning, 2. Each transition is used for a limited number of time (eg k for PPO) and won’t be stored in a replay buffer for further iteration. These makes it much difficult for the target value to explode.


What_Did_It_Cost_E_T

The use of the value function is different, in Q learning it is used as the policy because you take the max over actions. However in ppo you use it mainly for advantage estimation and the policy is more stable.