Decisions & Dragons

This site gathers my responses to reinforcement learning questions from various social media platforms and interactions.

Each answer begins with a concise explanation, accessible to those with relevant background knowledge. Of course, if you’re here, you may not have all of the relevant background knowledge. This site is written for those of you who do not. Following the concise explanation, I derive the answer from first principles.

Below you will find a list of the most recently published Q&As and a preview of the answers. You can find the complete list in the posts section. You can learn more about me and why I created this site in the about section.

Recent posts

Should we abandon RL? Is it the right approach?

No, we should not abandon reinforcement learning. I get it though — RL algorithms are brittle, difficult to scale, and complicated. However, this question is predicated on a misconception. RL is not an approach. RL is a problem definition.

RL is the problem of determining how an agent should make decisions in an unfamiliar environment. It must act to both learn about its environment and pursue its objective. It must learn from its experiences, rather than a human-curated dataset.

We don’t abandon problems. They are imposed upon us and we ignore them at our peril. Asking if a problem is “right” isn’t a coherent question. What is a coherent question is whether a problem is important or relevant.

RL is an important problem. If you care about building systems that can act and learn in the world like people do, then the RL problem is impossible to avoid. The question is not whether we should abandon RL, but how we can make better algorithms to solve it.

[Show full answer]

Why is it better to subtract a baseline in REINFORCE?

Suppose we have a softmax policy over five actions. In the current state, an oracle tells us that the true Q-values are -1, -2, -3, -4, and -5. The first action is the best, even though it is negative, so we should increase the probability of it. We sample an action from our policy. It’s the first! What a stroke of luck!

Or is it? Unfortunately, our algorithm is REINFORCE. It decreases the probability of selecting the winning action. Why? Because the REINFORCE stochastic gradient is the product of the Q-value estimate and the gradient of the log probability of the action: $\nabla \log \pi(a | s) Q^\pi(s, a)$. Since the Q-value is negative, REINFORCE decreases the action’s probability.

Unless we’re using a baseline. For example, if we subtracted a baseline of -3, then we would multiply the gradient of the log probability by $(-1 - -3) = 2$: a positive weight. In this case, REINFORCE correctly increases the probability of the action.

[Show full answer]

Why does experience replay require off-policy learning and how is it different from on-policy learning?

When you use an experience replay buffer, you save the most recent $k$ experiences of the agent, and sample data from that buffer for training. Typically, the agent does a step of training to update its policy for every step in the environment. At any moment in time, the vast majority of experiences in the buffer are generated with a different – earlier – policy than the current policy. And if the policy used to collect data is different than the policy being evaluated or improved, then you need an off-policy method.

[Show full answer]

What is the "horizon" in reinforcement learning?

In reinforcement learning, an agent receives reward on each time step. The goal, loosely speaking, is to maximize the future reward received. But that doesn’t fully define the goal, because each decision can affect what reward the agent can receive the future. Consequently, we’re left with the question “how does potential future reward affect our decision right now?” The “horizon” refers to how far into the future the agent will optimize its reward. You can have finite-horizon objectives, or even infinite-horizon objectives.

[Show full answer]

Why doesn't Q-learning work with continuous actions?

Q-learning requires finding the action with the maximum Q-value in two places: (1) In the learning update itself; and (2) when extracting the policy from the learned Q-values. When there are a small number of discrete actions, you can simply enumerate the Q-values for each and pick the action with the highest value. However, this approach does not work with continuous actions, because there are an infinite number of actions to evaluate!

[Show full answer]

Why is the DDPG gradient the product of the Q-function gradient and policy gradient?

The DDPG and DPG paper before it express the gradient of the objective $J(\pi)$ as the product of the policy and Q-function gradients:

$$ \nabla_\theta J(\pi) = E_{s \sim \rho^\pi} \left[\nabla_\theta \pi_\theta(s) \nabla_a Q(s, a) \rvert_{a \triangleq \pi_\theta(s)} \right]. $$

This expression looks a little scary, but it’s conveying a straightforward concept: the gradient is the average of the Q-function’s gradient with respect to the policy parameters, evaluated at the policy’s selected action. That may not be obvious because the product of “gradients” (spoiler: there is some notation abuse) is the result of applying the multivariable chain rule of differentiation. If we were to reverse this step, the expected value would simplify to the more explicit expression $\nabla_\theta Q(s, \pi_\theta(s))$.

[Show full answer]

If Q-learning is off-policy, why doesn't it require importance sampling?

In off-policy learning, we evaluate the value function for a policy other than the one we are following in the environment. This difference creates a mismatch in state-action distributions. To account for this difference, some actor-critic methods use importance sampling. However, Q-learning does not. There is a simple reason for that: In Q-learning, we only use samples to tell us about the effect of actions on the environment, not to estimate how good the policy action selection is. Let’s make that more concrete with a simple example and re-derive the Q-learning and importance sampling approaches.

[Show full answer]

What is the difference between V(s) and Q(s,a)?

State value function $V(s)$ expresses how well the agent expects to do when it acts normally. $Q(s, a)$ is a counterfactual function that expresses how well the agent expects to do if it first takes some potentially alternative action before acting normally.

[Show full answer]

Show more posts

Decisions & Dragons

Recent posts

Should we abandon RL? Is it the right approach?

Why is it better to subtract a baseline in REINFORCE?

Why does experience replay require off-policy learning and how is it different from on-policy learning?

What is the "horizon" in reinforcement learning?

Why doesn't Q-learning work with continuous actions?

Why is the DDPG gradient the product of the Q-function gradient and policy gradient?

If Q-learning is off-policy, why doesn't it require importance sampling?

What is the difference between V(s) and Q(s,a)?

Why does the policy gradient include a log probability term?

What is the difference between model-based and model-free RL?