This site gathers my responses to reinforcement learning questions from various social media platforms and interactions.
Each answer begins with a concise explanation, accessible to those with relevant background knowledge. Of course, if you’re here, you may not have all of the relevant background knowledge. This site is written for those of you who do not. Following the concise explanation, I derive the answer from first principles.
Below you will find a list of the most recently published Q&As and a preview of the answers. You can find the complete list in the posts section. You can learn more about me and why I created this site in the about section.
Recent posts
Why does experience replay require off-policy learning and how is it different from on-policy learning?
When you use an experience replay buffer, you save the most recent $k$ experiences of the agent, and sample data from that buffer for training. Typically, the agent does a step of training to update its policy for every step in the environment. At any moment in time, the vast majority of experiences in the buffer are generated with a different – earlier – policy than the current policy. And if the policy used to collect data is different than the policy being evaluated or improved, then you need an off-policy method.
What is the "horizon" in reinforcement learning?
In reinforcement learning, an agent receives reward on each time step. The goal, loosely speaking, is to maximize the future reward received. But that doesn’t fully define the goal, because each decision can affect what reward the agent can receive the future. Consequently, we’re left with the question “how does potential future reward affect our decision right now?” The “horizon” refers to how far into the future the agent will optimize its reward. You can have finite-horizon objectives, or even infinite-horizon objectives.
Why doesn't Q-learning work with continuous actions?
Q-learning requires finding the action with the maximum Q-value in two places: (1) In the learning update itself; and (2) when extracting the policy from the learned Q-values. When there are a small number of discrete actions, you can simply enumerate the Q-values for each and pick the action with the highest value. However, this approach does not work with continuous actions, because there are an infinite number of actions to evaluate!
Why is the DDPG gradient the product of the Q-function gradient and policy gradient?
The DDPG and DPG paper before it express the gradient of the objective $J(\pi)$ as the product of the policy and Q-function gradients:
$$ \nabla_\theta J(\pi) = E_{s \sim \rho^\pi} \left[\nabla_\theta \pi_\theta(s) \nabla_a Q(s, a) \rvert_{a \triangleq \pi_\theta(s)} \right]. $$
This expression looks a little scary, but it’s conveying a straightforward concept: the gradient is the average of the Q-function’s gradient with respect to the policy parameters, evaluated at the policy’s selected action. That may not be obvious because the product of “gradients” (spoiler: there is some notation abuse) is the result of applying the multivariable chain rule of differentiation. If we were to reverse this step, the expected value would simplify to the more explicit expression $\nabla_\theta Q(s, \pi_\theta(s))$.
If Q-learning is off-policy, why doesn't it require importance sampling?
In off-policy learning, we evaluate the value function for a policy other than the one we are following in the environment. This difference creates a mismatch in state-action distributions. To account for this difference, some actor-critic methods use importance sampling. However, Q-learning does not. There is a simple reason for that: In Q-learning, we only use samples to tell us about the effect of actions on the environment, not to estimate how good the policy action selection is. Let’s make that more concrete with a simple example and re-derive the Q-learning and importance sampling approaches.
What is the difference between V(s) and Q(s,a)?
State value function $V(s)$ expresses how well the agent expects to do when it acts normally. $Q(s, a)$ is a counterfactual function that expresses how well the agent expects to do if it first takes some potentially alternative action before acting normally.
Why does the policy gradient include a log probability term?
Actually, it doesn’t! What you’re probably thinking of is the REINFORCE estimate of the policy gradient. How we derive the REINFORCE estimate you’re familiar with and why we use it is something I found to be poorly explained in literature. Fortunately, it is not a hard concept to learn!
What is the difference between model-based and model-free RL?
In reinforcement learning, the agent is not assumed to know how the environment will be affected by its actions. Model-based and model-free reinforcement learning tackle this problem in different ways. In model-based reinforcement learning, the agent learns a model of how the environment is affected by its actions and uses this model to determine how to act. In model-free reinforcement learning, the agent learns how to act without ever learning to precisely predict how the environment will be affected by its actions.