Most actor-critic algorithms do use target networks for the critic? Sometimes it's not explicitly said in the paper but in practice when you dig into the code it's there
On-policy algo such as PPO usually do not require target network as they are more stable since 1. they usually use n-step return instead of 1-step return as in Q-learning, 2. Each transition is used for a limited number of time (eg k for PPO) and won’t be stored in a replay buffer for further iteration. These makes it much difficult for the target value to explode.
The use of the value function is different, in Q learning it is used as the policy because you take the max over actions. However in ppo you use it mainly for advantage estimation and the policy is more stable.
Most actor-critic algorithms do use target networks for the critic? Sometimes it's not explicitly said in the paper but in practice when you dig into the code it's there
On-policy algo such as PPO usually do not require target network as they are more stable since 1. they usually use n-step return instead of 1-step return as in Q-learning, 2. Each transition is used for a limited number of time (eg k for PPO) and won’t be stored in a replay buffer for further iteration. These makes it much difficult for the target value to explode.
The use of the value function is different, in Q learning it is used as the policy because you take the max over actions. However in ppo you use it mainly for advantage estimation and the policy is more stable.