Deep Q Networks (DQN) are an effective deep learning method for solving many problems. DQN is not, however, foolproof and can be quite difficult to train effectively or can simply fail to learn. The authors of this paper discuss the three main reasons that DQN can fail—the so-called Deadly Triad—and why DQN can sometimes work anyway.

The components of the “Deadly Triad” are:

- Function Approximation
- Bootstrapping
- Off-policy Learning

In this blog post, we define and discuss each of these and some ways in which they can be mitigated.

Why is this important? Everything we know suggests that DQN should not work, and if it does, we got lucky with our setup. But DQN does work, at least some of the time, and more reliably than one might expect. This suggests that we don’t yet fully understand what is going on in DQN. In the remainder of this blog post, we cover what the components of the deadly triad are, the problems they can cause, and some of the ways we can sidestep their undesirable effects.

## 1. Function Approxmiation

Function approximation in Reinforcement Learning gives us the ability to better handle large state and action spaces by *approximating* the function we are trying to learn (e.g. a Q-function) rather than learning the function exactly. First, we cover the basics of how RL can work without function approximation, then look at how function approximation is applied when the state and/or action spaces become too large to sample effectively using exact techniques, such as value iteration or policy iteration.

The Q-function $Q(s, a)$—that is, the state-action value function—is a mapping of states and actions to a value in the set of all real numbers:

\begin{align}

Q: S \times A \rightarrow \mathcal{R}

\label{q_mapping}

\end{align}

Recall that value iteration evaluates the utility of a state as the value of the maximum Q-value at that state, over all the actions the agent can take in that state, as we see in Equation \ref{value_iter}.

\begin{align}

V(s) = \max_a Q(s, a)

\label{value_iter}

\end{align}

When we look at methods that use the Q-function, such as value iteration, we form a table for $V$ (see left of Fig. 1) and for the Q-values (see right of Fig. 1) and iteratively update those tables as our estimates of the utility of each state and the expected value of each state-action pair improve during training.

Under the tabular (i.e. exact) setting, we can, given a sufficient number of iterations, learn the optimal Q-function, value function, or policy.

Usually, however, our state spaces are far too large for tabular Q-learning to be remotely practical. It simply takes far too long to cover the entire table thoroughly enough to reach convergence to the optimal policy, or anything even close. Even high-performance computers will lack the RAM required to store a large Q-table in memory. Rather than wait years for our learner to visit every state-action pair in the Q-table enough to allow the values to converge, we instead try to form an estimate of what this table would tell us using function approximation.

This function approximation was originally performed by effectively placing multi-dimensional Gaussians in our $Q(s, a)$ table to represent the action choices in the state space with probability distributions. These Gaussians interpolate a desired output value (e.g. a q-value) between inputs (e.g. state-action pairs). The number of Gaussian kernels $\mathcal{k}$ that we need to adequately cover all of the states and actions could be far smaller than the size of the $Q$-table, $|S| \times |A|$.

Modern approaches to function approximation employ neural networks. The neural network learns a surface, such as that shown in Figure 2, that covers what the Q-table did and estimates the utility of taking a specific action in a specific state, just as the Q-table does. Fitting a surface over the state $\times$ action space instead of a table has an additional benefit: when the agent finds itself in a state it has not encountered before, it can still interpolate an estimate of the utility of its current state using the function approximator (i.e. the neural network). An important point to remember, though, is that function approximation, despite this benefit, is almost never going to *perfectly* learn the Q-function. We can apply gradient descent to get closer to the true value function, but our function approximator will

almost never be perfectly accurate.

Thus, using a function approximation is not *true* interpolation between the known-value points, but it is something close to it.

Function approximation is also used in policy gradient methods; neural networks can be used to represent the policy, as illustrated in Figure 2, and the transition function, as illustrated in Figure 3.

In a previous entry, we introduced the Bellman Residual (Equation \ref{bellman-res}). If we are parameterizing $Q$ by some $\theta$ such that $Q_{\theta}: S \times A \longrightarrow \mathcal{R}$, then we can use the Bellman residual as the loss function we use to train $Q_{\theta}$.

\begin{align}

L(\theta) = \mathbb{E} \left[ \left((r_t + \gamma \max_a Q_{\theta}(s_{t+1}, a)) – Q_{\theta}(s_t, a_t)\right)^2 \right]
\label{bellman-res}

\end{align}

Equation \ref{bellman-res} aims to minimize the error between the points that the agent has visited and estimated Q for and our predictions of what those points should be.

What is the problem with this? When we apply gradient descent, we assume that $(r_t + \gamma \max_a Q_{\theta}(s_{t+1}, a))$ from Equation \ref{bellman-res} is a constant with respect to $\theta$. Thus, when we take the gradient of $L(\theta)$, the only thing we apply the derivative to is $Q_{\theta}(s_{t+1}, a)$. We will see why this creates challenges for us in the next section.

## 2. Bootstrapping

With respect to the Deadly Triad, we define bootstrapping as taking information we already have and trying to do more with it than we should be able to do. Let’s look at the instability this can cause and the ways practitioners have come up with to mitigate these instabilities.

When we looked at policy gradients and REINFORCE, we learned that the update for the policy parameters $\phi$ is defined by Equation \ref{policy-grad-eqn}, where $i$ is the designator of the trajectory from which the states and actions are drawn (for the case of doing multiple policy rollouts), and $t$ is the timestep.

\begin{align}

\Delta \phi = A_{it} \nabla_{\phi} \log \pi_{\phi}(s_{it}, a_{it})

\label{policy-grad-eqn}

\end{align}

We compute $A_{it}$, the cumulative reward term, as shown in Equation \ref{a-i-t}.

\begin{align}

A_{it} = \sum_{t’=t}^{T_i} \left(\gamma^{t’-t} r_{t’} \right)

\label{a-i-t}

\end{align}

How is this applied? When computing $\Delta \phi$ and $A_{it}$, we look at each timestep in the trajectory from timestep $t$ until the end, identify what the reward is, and back up the (discounted) reward across prior timesteps. This means that a reward applied many timesteps into the future still affects this cumulative reward term, $A_{it}$. In a scenario that gives sparse rewards, backing up rewards to influence the

evaluations of earlier timesteps in this manner is particularly helpful, as it helps the agent learn the steps it took along the way to getting a reward at some specific timestep. This gives the learner information from later in the trajectory instead of it needing to rely on a Q-function to tell it whether it is going in a (positively) rewarding direction.

Notably, an alternative way to compute the gradient of $\phi$ is Equation \ref{alt-phi-grad}.

\begin{align}

\Delta \phi = Q_{\phi}(s, a) \nabla_{\phi} \log \pi_{\phi}(s_{it}, a_{it})

\label{alt-phi-grad}

\end{align}

When we are training $Q_{\phi}(s, a)$ with the Bellman residual and holding $(r_t + \gamma \max_a Q_{\phi}(s_{t+1}, a))$ constant, we encounter what we call the “chasing one’s tail” or “runaway” effect, which we discuss in the blog post about DQN. Specifically, in Equation \ref{alt-phi-grad}, we are using the Q-function as its own target by regressing $Q(s, a)$ towards the target, $R(s, a) + \gamma \max_{a’} Q(s’, a’)$, which itself contains $Q$!.

That is what bootstrapping means in the context of the Deadly Triad— using the thing that you are trying to achieve as the very thing you are using to achieve it.

What does *not* bootstrapping look like in Q-learning? Instead of computing $L(\theta)$ as we did in Equation \ref{bellman-res}, we compute the loss by looking $n$ steps into the future, as we demonstrate in Equation \ref{n-step-lookahead}.

\begin{align}

\hspace{-3.75em}L(\theta) = \mathbb{E} \left[ \left(\sum_{t’ = t}^{t + n – 1} \left(r_t + \gamma^{t’ – t} r_{t’}\right) +

\gamma^{t + n} \max_a Q(s_{t+n}, a) – Q_{\theta}(s_t, a_t) \right)^2 \right]
\label{n-step-lookahead}

\end{align}

We select $n$ such that the $n$th step will be beyond the final timestep of the episode, which simplifies Equation \ref{n-step-lookahead} to Equation \ref{n-step-lookahead-simp}.

\begin{align}

L(\theta) = \mathbb{E} \left[ \left(\sum_{t’ = t}^{t + n – 1} \left(r_t + \gamma^{t’ – t} r_{t’}\right) – Q_{\theta}(s_t, a_t) \right)^2 \right]
\label{n-step-lookahead-simp}

\end{align}

Q-learning is known for having an overestimation problem; the $\max$ operation that determines which action to select in a given state biases the estimation of Q-values of states to the high side, sometimes to the point of drowning out the reward signal. Sometimes this problem can be fixed by hyperparameter tuning, but not always. So how do we keep the bias caused by bootstrapping at bay?

### How to fix bootstrapping?

A common tactic to alleviate the bias caused by bootstrapping is by applying Target Q-learning; that is, we have two Q-networks, and calculate the Bellman residual as shown in Equation \ref{loss-fcn-targ}.

\begin{align}

L(\theta) = \mathbb{E} \left[ \left((r_t + \gamma \max_{a’} Q_{\theta’}(s_{t+1}, a’)) – Q_{\theta}(s_t, a_t) \right)^2 \right]
\label{loss-fcn-targ}

\end{align}

The difference between Equations \ref{bellman-res} and \ref{loss-fcn-targ} is that Equation \ref{loss-fcn-targ} uses the target Q-network to compute the maximizing action for time $t+1$ rather than the same Q-network as in the latter part of the loss function equation. We update the target Q-network as the main Q-network trains, as in Equation

\ref{target-update}, but it lags behind; $\alpha$ is a small scalar. This lagging target network update can help to stabilize the runaway chasing-one’s-tail effect.

\begin{align}

\phi \leftarrow \phi \left(1 – \alpha\right) + \alpha \theta

\label{target-update}

\end{align}

Another approach to ameliorating the problems of bootstrapping is Inverse Double Q-learning. The $\max_{a’}$ term in the Bellman residual equation (Equation \ref{bellman-res}) is replaced with $Q_{\theta}(s_{t+1}, \underset{a’}{\operatorname{argmax}}~ Q_{\phi}(s_{t+1}, a’))$, where $Q_{\theta}$ is for the “real” network and $Q_{\phi}$ is the target network. Equation \ref{inv-double-q} is the resulting loss function for Inverse Double Q-learning.

\begin{align}

\hspace{-3.25em}L(\theta) = \mathbb{E} \left[ \left((r_t + \gamma Q_{\theta}(s_{t+1}, \underset{a’}{\operatorname{argmax}}~ Q_{\phi}(s_{t+1}, a’))) – Q_{\theta}(s_t, a_t)\right)^2 \right]
\label{inv-double-q}

\end{align}

In this case, we still use the learner network to estimate the Q-value, but rather than picking the action that is maximizing in state $s_{t+1}$ according to the learner network, we use the target network to evaluate the maximizing action.

Yet another approach to ameliorate bootstrapping, Double Deep Q-Networks (DDQN), replaces the $\max$ term in the Bellman residual with $Q_{\phi}(s_{t+1}, \underset{a’}{\operatorname{argmax}}~ Q_{\theta}(s_{t+1}, a’))$, a term that uses the target network (subscript $\phi$) to evaluate the value of a state-action pair and uses the learner network (subscript $\theta$) to evaluate which action is the maximizing action in that state. The resulting loss function for DDQN is given in Equation \ref{ddqn-targ}.

\begin{align}

\hspace{-3.25em}L(\theta) = \mathbb{E} \left[ \left((r_t + \gamma Q_{\phi}(s_{t+1}, \underset{a’}{\operatorname{argmax}}~ Q_{\theta}(s_{t+1}, a’))) – Q_{\theta}(s_t, a_t)\right)^2 \right]
\label{ddqn-targ}

\end{align}

The advantage of DDQN’s approach is that we use a target network, but the action selection is decoupled from between the target and learner networks. In general, DDQN is generally accepted as the current “best” method for ameliorating the effects of bootstrapping. An additional variant on this approach is to use two target networks, evaluate which action to select using both target networks, and choosing the action that gives the lower Q-value.

In general, all of these methods of mitigating the chasing-one’s-tail problem of bootstrapping involve using a target network and decoupling the action selection from the evaluation of the state-action pair’s value in calculating the Bellman residual.

## 3. Off-policy Learning

In the context of the Deadly Triad, we primarily use the term “Off-policy learning” to refer to RL setups that utilize experience replay during training. Specifically, we utilize data generated using an older version of our Q-function under an $\epsilon$-greedy sampling strategy to improve our current estimate of the Q-function when we operate under it with $\epsilon = 0$. For more detail on on- and off-policy learning, please see the blog post on the subject. Looking at that blog post and what it says about Deep Deterministic Policy Gradients (DDPG), we know that off-policy learning says that $Q_{\theta}(s, a)$ is not a critic of $\pi’

(\cdot | s)$, an arbitrary policy network. Instead, it is a critic of $\pi^*(s) = \underset{a}{\operatorname{argmax}}~Q(s,a)$. The process by which we generate the state-action pairs to train our Q-function can be different from what the critic is evaluating; that is, we decouple sampling/exploration from what the critic is judging. This decoupling is generally considered preferable for sample-efficiency reasons, among others. Off-policy learning, however, is difficult to do well in the context of experience replay.

### What is Experience Replay?

In experience replay, we have a database, $D = \{ \langle s_0, a_0, r_0, s_0′ \rangle, \langle s_1, a_1, r_1, s_1′ \rangle, \dots \}$, in which we collect sample experiences from which we train our policy network. These databases, often called replay buffers, are key components of many RL frameworks.

Even though we train from a replay buffer of past experiences, we can modulate the degree to which we train our network off-policy. Q-learning is not technically off-policy, as, when taking the expectation of the Bellman residual, the states from which we compute the expectation must be drawn from somewhere. The distribution from which the states are drawn in computing that expectation *should* be a uniform distribution over all states, as in Value Iteration, but instead, we draw the state-action pairs from the replay buffer, whih is biased towards some state-action pairs over others based upon our sampling history. The replay buffer is skewed towards the policy used to generate the experiences

within it.

In the paper that inspired this blog post, van Hasselt, et al. explore prioritizing the experiences in their agent’s replay buffer by setting the probability of sampling a specific point $p_i$ from the replay buffer to be proportional to the error—the Bellman residual. That is, they rank the samples in the replay buffer according to how wrong the function approximator is when evaluating them and then train the Q-function to prioritize selecting points that are ranked as being more incorrect so that the learner can fix its evaluation of the Q-value estimate of those experiences. While this approach seems reasonable—a human who is very unskilled at a specific task who practices that task is sure to learn to improve their performance of that task, even if only a little—it actually increases the likelihood of the Q-network *diverging* when compared to training the Q-network on uniformly-sampled experiences. The takeaway here is that, while prioritizing specific examples in a replay buffer is an interesting area of research, if we just want our learner to learn, sampling non-uniformly from our replay buffer could make things worse. An additional divergence risk is the size of the neural network itself. In general, for DQN, the risk of the function approximator diverging increases as the network size increases. Conversely, for DDQN, the risk of the network diverging is agnostic of network size.

## Conclusions

DQN and its variants are frequently used because, not only are they widely applicable, but they work in a lot of scenarios. The Deadly Triad—function approximation, bootstrapping, and off-policy learning— can all cause complications in the application of DQN, however; in this blog post, we covered these challenges and common approaches to handling them. The main takeaways from this blog post are:

- Function approximation does not have the same nice guarantees that tabular learning has, but it is far more flexible when the state and/or action spaces are large.
- To ameliorate the instability bootstrapping can cause, consider either using a multi-step update in the Bellman residual or using DDQN.
- When using a replay buffer, sample from the replay buffer uniformly—don’t try to prioritize any experiences over others

Many thanks to Ruisen “Eric” Liu for editing this blog post and providing suggestions for how to improve it.

## References

- Mnih, Volodymyr, et al.
*“Playing Atari With Deep Reinforcement Learning.”*arXiv preprint arXiv:1312.5602 (2013). - van Hasselt, Hado, et al.
*Deep Reinforcement Learning and the Deadly Triad.*arXiv preprint arXiv:1812.02648 (2018). - R. S. Sutton and A. G. Barto.
*Reinforcement Learning: An Introduction*, 2nd ed. The MIT Press, Cambridge MA, 2018. - Lillicrap, Timothy P., et al.
*Continuous Control With Deep Reinforcement Learning*International Conference on Learning Representations, Sep. 2016.