Why does experience replay require off-policy learning and how is it different from on-policy learning?
When you use an experience replay buffer, you save the most recent $k$ experiences of the agent, and sample data from that buffer for training. Typically, the agent does a step of training to update its policy for every step in the environment. At any moment in time, the vast majority of experiences in the buffer are generated with a different – earlier – policy than the current policy. And if the policy used to collect data is different than the policy being evaluated or improved, then you need an off-policy method.
Off-policy vs on-policy
There is often confusion about the meaning of off-policy and on-policy. Many people use “on-policy” to refer to any method that evaluates an explicit policy. In this definition, any actor-critic method would be an “on-policy” method because in actor-critic methods, the actor is an explicit parameterized policy that the critic (the value function) evaluates. While many actor-critic methods are on-policy, the term classically means something different, and you can in fact have off-policy actor-critic methods.
The classic meaning of on-policy vs off-policy regards whether your training method requires your training data to be collected from the policy to be evaluated and improved, or whether it can be used even if your data is collected from a different policy.
If you hold your policy constant and then collect a bunch of data with it, then this data distribution is on-policy and you can use an on-policy method to evaluate/improve it.
If your data is generated by some other policy, be it an exploration policy, older versions of your policy, or maybe even some other expert, then you will need an off-policy method. Since the experience replay buffer is dominated by data generated by earlier versions of the agent’s policy, you will need an off-policy method to do policy evaluation/improvement from it.
Evaluation vs improvement and the strange case of PPO
You may have noticed I keep naming two cases where on/off policy is relevant: for policy evaluation and policy improvement. For most algorithms, both the evaluation and improvement will be on-policy or off-policy. However, evaluation and improvement are two distinct steps. You could have one part be on-policy while the other is off-policy.
PPO is an example where the policy evaluation is on-policy while the improvement is off-policy. Although PPO does not use an experience replay buffer, its policy improvement requires an off-policy method for the same reason you need an off-policy method when using data from an experience replay buffer: it’s improving the actor policy using data from an earlier policy.
Evaluation vs improvement
First, let’s give some definitions to policy evaluation/improvement. These terms come from the steps of policy iteration, the foundation for many RL methods. In policy iteration, you repeat two steps until a policy stops improving.
- Evaluate $Q^\pi(s, a)$ for your current policy $\pi$ for all state-action pairs.
- Improve your policy $\pi$ by updating it to maximize $Q(s, \pi_\theta(s))$ for each state.
In the first evaluation step, we evaluate the Q-function for the given policy. It is worth noting that we don’t have to explicitly model $Q^\pi$. There are other approaches we could take such as explicitly modeling the state value function $V^\pi(s)$, and then implicitly derive $Q^\pi$ with observed transitions. Alternatively, we could explicitly model $V^\pi$ and the environment transition dynamics $T(s’ | s, a)$ from which we could derive $Q^\pi$.
Regardless of whether we explicitly or implicitly model $Q^\pi$, “evaluation” refers to estimating a value function for a policy. If you are having difficulty understanding the exact definition of value functions and difference between $Q$ and $V$, you may want look at my answer to this question.
The term “improvement” regards the second step: how you make your policy better maximize the value function. As you might expect, there are many different ways to improve your policy given a value function estimate.
Because evaluation and improvement are separate steps, you can use different methods and data to perform them. Let’s briefly review the core idea behind PPO to help explain how you might perform these steps differently.
PPO
PPO is roughly the following algorithm.
Initialize state value function V.
Initialize actor policy pi.
Do forever:
Collect k n-length trajecories
For each trajectory i to k:
For each step j to n:
// compute returns
Rij = discounted sum of rewards j to n
// compute advantages
Aij = Rij - V(sij)
For M SGD steps:
Update V(sij) toward Rij
Update policy pi using PPO CLIP
Here, the PPO CLIP objective is defined as
$$ L(s, a, \theta_\text{old}, \theta) = \min\left(\frac{\pi_\theta(a | s)}{\pi_{\theta_\text{old}}(a | s)}A(s, a), \text{clip}\left(\frac{\pi_\theta(a | s)}{\pi_{\theta_\text{old}}(a | s)}, 1-\epsilon, 1+\epsilon \right)A(s, a) \right), $$
where $\pi_{\theta_\text{old}}$ is the behavior policy: the policy we used to the collect the last $k$ trajectories before doing any updates, and $\pi_\theta$ is the current version of the actor.
PPO is an interesting case because its evaluation method is on-policy, while its policy improvement is off-policy. If you look at the above algorithm, we are updating V (over multiple steps of SGD) toward value targets of the behavior policy we used to collect the data. Although we are simultaneously improving the actor, the value function is not evaluating the actor, it evaluates the behavior policy. The method it uses to evaluate the behavior policy is an on-policy Monte-Carlo method1 that requires the data to be on-policy.
Simultaneously, the policy improvement of the actor is performed over multiple steps of SGD. On the first SGD step, the behavior policy and actor match, resulting in an on-policy update. However, after that first step, the actor and behavior policy are different. Because we are still using the behavior policy data to improve the actor, we need an off-policy policy improvement method.
PPO’s policy improvement accounts for the off-policy data in two ways. First, it uses an importance sampling ratio to correct for the difference in distributions. That’s the $\frac{\pi_\theta(a | s)}{\pi_{\theta_\text{old}}(a | s)}$ term. Second, it clips the updates once the actor drifts too far from the behavior policy, ensuring the data it has is close enough to provide good estimates of the true policy objective. If you don’t understand how the importance sampling ratio corrects for off-policy data, see my discussion about it here.
So, although many RL methods are either off-policy or on-policy for both evaluation and improvement, this need not be the case. PPO is an example where the evaluation is on-policy (it evaluates the behavior policy), while the improvement step is off-policy (it improves the actor which diverges from the behavior policy).
Is on-policy or off-policy better?
There is no clear answer to whether on-policy of off-policy is better. All else being equal, off-policy is a more preferable setting. It allows the agent to learn from a wider source of data, while on-policy methods are more wasteful and require new data every time the policy is updated. However, at this current moment in time, on-policy methods tend to be more stable than off-policy methods. So if gathering data from your policy is cheap, you might prefer to use an on-policy method. If it’s expensive, you might prefer an off-policy method.
Summary
Off-policy and on-policy is a property inherent to methods for policy evaluation and policy improvement. On-policy methods require the data being used to be generated from the policy being evaluated or improved. Off-policy methods can use data from distributions other than the policy being evaluated or improved. While most methods are off-policy or on-policy for both their evaluation and improvement steps, this does not have to be the case. PPO is an example that uses on-policy evaluation of the behavior policy, and off-policy improvement of the actor model.
Whenever using data generated from an older policy than the one you want to evaluate or improve, you need an off-policy method because the policies are different. This is the case in PPO’s policy improvement using data from the older behavior policy. Likewise, you will need an off-policy method when you use an experience replay buffer that contains data from older versions of the policy.
It’s more typically a TD($\lambda$) flavor of algorithm that interpolates between Monte-Carlo returns and TD. ↩︎