• Skip to primary navigation
  • Skip to content
  • Skip to primary sidebar
  • Skip to secondary sidebar

CORE Robotics Lab

  • Home
  • Publications
  • People
  • Robots
  • Blog
  • Contact Us

Matthew Gombolay

Bootcamp Summer 2020 Week 7 – DQN And The Deadly Triad, Or, Why DQN Shouldn’t Work But Still Does

July 9, 2022 by Laura Strickland and Matthew Gombolay

Deep Q Networks (DQN) are an effective deep learning method for solving many problems. DQN is not, however, foolproof and can be quite difficult to train effectively or can simply fail to learn. The authors of this paper discuss the three main reasons that DQN can fail—the so-called Deadly Triad—and why DQN can sometimes work anyway.

The components of the “Deadly Triad” are:

  1. Function Approximation
  2. Bootstrapping
  3. Off-policy Learning

In this blog post, we define and discuss each of these and some ways in which they can be mitigated.

Why is this important? Everything we know suggests that DQN should not work, and if it does, we got lucky with our setup. But DQN does work, at least some of the time, and more reliably than one might expect. This suggests that we don’t yet fully understand what is going on in DQN. In the remainder of this blog post, we cover what the components of the deadly triad are, the problems they can cause, and some of the ways we can sidestep their undesirable effects.

1. Function Approxmiation

Function approximation in Reinforcement Learning gives us the ability to better handle large state and action spaces by approximating the function we are trying to learn (e.g. a Q-function) rather than learning the function exactly. First, we cover the basics of how RL can work without function approximation, then look at how function approximation is applied when the state and/or action spaces become too large to sample effectively using exact techniques, such as value iteration or policy iteration.

The Q-function $Q(s, a)$—that is, the state-action value function—is a mapping of states and actions to a value in the set of all real numbers:

\begin{align}
Q: S \times A \rightarrow \mathcal{R}
\label{q_mapping}
\end{align}

Recall that value iteration evaluates the utility of a state as the value of the maximum Q-value at that state, over all the actions the agent can take in that state, as we see in Equation \ref{value_iter}.

\begin{align}
V(s) = \max_a Q(s, a)
\label{value_iter}
\end{align}

When we look at methods that use the Q-function, such as value iteration, we form a table for $V$ (see left of Fig. 1) and for the Q-values (see right of Fig. 1) and iteratively update those tables as our estimates of the utility of each state and the expected value of each state-action pair improve during training.

Left side: A table showing the utility of specific states in a simple discrete (or discretized) state space. Right side: A table showing the Q-value of specific state-action pairs in a scenario with simple discrete (or discretized) state and action spaces. (Note that these values are for illustrative purposes only.)
The utility of each state and the Q-value of each state-action pair can be represented in tabular forms. Note that the V(S) values in the left table correspond to the Q-values of the maximizing actions, according to Equation \ref{value_iter}.

Under the tabular (i.e. exact) setting, we can, given a sufficient number of iterations, learn the optimal Q-function, value function, or policy.

Usually, however, our state spaces are far too large for tabular Q-learning to be remotely practical. It simply takes far too long to cover the entire table thoroughly enough to reach convergence to the optimal policy, or anything even close. Even high-performance computers will lack the RAM required to store a large Q-table in memory. Rather than wait years for our learner to visit every state-action pair in the Q-table enough to allow the values to converge, we instead try to form an estimate of what this table would tell us using function approximation.

This function approximation was originally performed by effectively placing multi-dimensional Gaussians in our $Q(s, a)$ table to represent the action choices in the state space with probability distributions. These Gaussians interpolate a desired output value (e.g. a q-value) between inputs (e.g. state-action pairs). The number of Gaussian kernels $\mathcal{k}$ that we need to adequately cover all of the states and actions could be far smaller than the size of the $Q$-table, $|S| \times |A|$.

Modern approaches to function approximation employ neural networks. The neural network learns a surface, such as that shown in Figure 2, that covers what the Q-table did and estimates the utility of taking a specific action in a specific state, just as the Q-table does. Fitting a surface over the state $\times$ action space instead of a table has an additional benefit: when the agent finds itself in a state it has not encountered before, it can still interpolate an estimate of the utility of its current state using the function approximator (i.e. the neural network). An important point to remember, though, is that function approximation, despite this benefit, is almost never going to perfectly learn the Q-function. We can apply gradient descent to get closer to the true value function, but our function approximator will
almost never be perfectly accurate.

In this image, we illustrate that the π value of a specific state-action pair is output by the actor network used in a policy-gradient setup.
An example of a surface learned by a neural network trained with DQN.

Thus, using a function approximation is not true interpolation between the known-value points, but it is something close to it.

Function approximation is also used in policy gradient methods; neural networks can be used to represent the policy, as illustrated in Figure 2, and the transition function, as illustrated in Figure 3.

In this image, we illustrate that the π value of a specific state-action pair is output by the actor network used in a policy-gradient setup.
Representation of the policy network in policy gradient setups.
In this image, we illustrate that the transition probability of starting in state s, taking action a, and arriving in state s' is defined by a neural network.
Function approximation can represent the state transition function in policy gradient learning scenarios.

In a previous entry, we introduced the Bellman Residual (Equation \ref{bellman-res}). If we are parameterizing $Q$ by some $\theta$ such that $Q_{\theta}: S \times A \longrightarrow \mathcal{R}$, then we can use the Bellman residual as the loss function we use to train $Q_{\theta}$.

\begin{align}
L(\theta) = \mathbb{E} \left[ \left((r_t + \gamma \max_a Q_{\theta}(s_{t+1}, a)) – Q_{\theta}(s_t, a_t)\right)^2 \right] \label{bellman-res}
\end{align}

Equation \ref{bellman-res} aims to minimize the error between the points that the agent has visited and estimated Q for and our predictions of what those points should be.

What is the problem with this? When we apply gradient descent, we assume that $(r_t + \gamma \max_a Q_{\theta}(s_{t+1}, a))$ from Equation \ref{bellman-res} is a constant with respect to $\theta$. Thus, when we take the gradient of $L(\theta)$, the only thing we apply the derivative to is $Q_{\theta}(s_{t+1}, a)$. We will see why this creates challenges for us in the next section.

2. Bootstrapping

With respect to the Deadly Triad, we define bootstrapping as taking information we already have and trying to do more with it than we should be able to do. Let’s look at the instability this can cause and the ways practitioners have come up with to mitigate these instabilities.

When we looked at policy gradients and REINFORCE, we learned that the update for the policy parameters $\phi$ is defined by Equation \ref{policy-grad-eqn}, where $i$ is the designator of the trajectory from which the states and actions are drawn (for the case of doing multiple policy rollouts), and $t$ is the timestep.

\begin{align}
\Delta \phi = A_{it} \nabla_{\phi} \log \pi_{\phi}(s_{it}, a_{it})
\label{policy-grad-eqn}
\end{align}

We compute $A_{it}$, the cumulative reward term, as shown in Equation \ref{a-i-t}.

\begin{align}
A_{it} = \sum_{t’=t}^{T_i} \left(\gamma^{t’-t} r_{t’} \right)
\label{a-i-t}
\end{align}

How is this applied? When computing $\Delta \phi$ and $A_{it}$, we look at each timestep in the trajectory from timestep $t$ until the end, identify what the reward is, and back up the (discounted) reward across prior timesteps. This means that a reward applied many timesteps into the future still affects this cumulative reward term, $A_{it}$. In a scenario that gives sparse rewards, backing up rewards to influence the
evaluations of earlier timesteps in this manner is particularly helpful, as it helps the agent learn the steps it took along the way to getting a reward at some specific timestep. This gives the learner information from later in the trajectory instead of it needing to rely on a Q-function to tell it whether it is going in a (positively) rewarding direction.

Notably, an alternative way to compute the gradient of $\phi$ is Equation \ref{alt-phi-grad}.

\begin{align}
\Delta \phi = Q_{\phi}(s, a) \nabla_{\phi} \log \pi_{\phi}(s_{it}, a_{it})
\label{alt-phi-grad}
\end{align}

When we are training $Q_{\phi}(s, a)$ with the Bellman residual and holding $(r_t + \gamma \max_a Q_{\phi}(s_{t+1}, a))$ constant, we encounter what we call the “chasing one’s tail” or “runaway” effect, which we discuss in the blog post about DQN. Specifically, in Equation \ref{alt-phi-grad}, we are using the Q-function as its own target by regressing $Q(s, a)$ towards the target, $R(s, a) + \gamma \max_{a’} Q(s’, a’)$, which itself contains $Q$!.
That is what bootstrapping means in the context of the Deadly Triad— using the thing that you are trying to achieve as the very thing you are using to achieve it.

What does not bootstrapping look like in Q-learning? Instead of computing $L(\theta)$ as we did in Equation \ref{bellman-res}, we compute the loss by looking $n$ steps into the future, as we demonstrate in Equation \ref{n-step-lookahead}.

\begin{align}
\hspace{-3.75em}L(\theta) = \mathbb{E} \left[ \left(\sum_{t’ = t}^{t + n – 1} \left(r_t + \gamma^{t’ – t} r_{t’}\right) +
\gamma^{t + n} \max_a Q(s_{t+n}, a) – Q_{\theta}(s_t, a_t) \right)^2 \right] \label{n-step-lookahead}
\end{align}

We select $n$ such that the $n$th step will be beyond the final timestep of the episode, which simplifies Equation \ref{n-step-lookahead} to Equation \ref{n-step-lookahead-simp}.

\begin{align}
L(\theta) = \mathbb{E} \left[ \left(\sum_{t’ = t}^{t + n – 1} \left(r_t + \gamma^{t’ – t} r_{t’}\right) – Q_{\theta}(s_t, a_t) \right)^2 \right] \label{n-step-lookahead-simp}
\end{align}

Q-learning is known for having an overestimation problem; the $\max$ operation that determines which action to select in a given state biases the estimation of Q-values of states to the high side, sometimes to the point of drowning out the reward signal. Sometimes this problem can be fixed by hyperparameter tuning, but not always. So how do we keep the bias caused by bootstrapping at bay?

How to fix bootstrapping?

A common tactic to alleviate the bias caused by bootstrapping is by applying Target Q-learning; that is, we have two Q-networks, and calculate the Bellman residual as shown in Equation \ref{loss-fcn-targ}.

\begin{align}
L(\theta) = \mathbb{E} \left[ \left((r_t + \gamma \max_{a’} Q_{\theta’}(s_{t+1}, a’)) – Q_{\theta}(s_t, a_t) \right)^2 \right] \label{loss-fcn-targ}
\end{align}

The difference between Equations \ref{bellman-res} and \ref{loss-fcn-targ} is that Equation \ref{loss-fcn-targ} uses the target Q-network to compute the maximizing action for time $t+1$ rather than the same Q-network as in the latter part of the loss function equation. We update the target Q-network as the main Q-network trains, as in Equation
\ref{target-update}, but it lags behind; $\alpha$ is a small scalar. This lagging target network update can help to stabilize the runaway chasing-one’s-tail effect.

\begin{align}
\phi \leftarrow \phi \left(1 – \alpha\right) + \alpha \theta
\label{target-update}
\end{align}

Another approach to ameliorating the problems of bootstrapping is Inverse Double Q-learning. The $\max_{a’}$ term in the Bellman residual equation (Equation \ref{bellman-res}) is replaced with $Q_{\theta}(s_{t+1}, \underset{a’}{\operatorname{argmax}}~ Q_{\phi}(s_{t+1}, a’))$, where $Q_{\theta}$ is for the “real” network and $Q_{\phi}$ is the target network. Equation \ref{inv-double-q} is the resulting loss function for Inverse Double Q-learning.

\begin{align}
\hspace{-3.25em}L(\theta) = \mathbb{E} \left[ \left((r_t + \gamma Q_{\theta}(s_{t+1}, \underset{a’}{\operatorname{argmax}}~ Q_{\phi}(s_{t+1}, a’))) – Q_{\theta}(s_t, a_t)\right)^2 \right] \label{inv-double-q}
\end{align}

In this case, we still use the learner network to estimate the Q-value, but rather than picking the action that is maximizing in state $s_{t+1}$ according to the learner network, we use the target network to evaluate the maximizing action.

Yet another approach to ameliorate bootstrapping, Double Deep Q-Networks (DDQN), replaces the $\max$ term in the Bellman residual with $Q_{\phi}(s_{t+1}, \underset{a’}{\operatorname{argmax}}~ Q_{\theta}(s_{t+1}, a’))$, a term that uses the target network (subscript $\phi$) to evaluate the value of a state-action pair and uses the learner network (subscript $\theta$) to evaluate which action is the maximizing action in that state. The resulting loss function for DDQN is given in Equation \ref{ddqn-targ}.

\begin{align}
\hspace{-3.25em}L(\theta) = \mathbb{E} \left[ \left((r_t + \gamma Q_{\phi}(s_{t+1}, \underset{a’}{\operatorname{argmax}}~ Q_{\theta}(s_{t+1}, a’))) – Q_{\theta}(s_t, a_t)\right)^2 \right] \label{ddqn-targ}
\end{align}

The advantage of DDQN’s approach is that we use a target network, but the action selection is decoupled from between the target and learner networks. In general, DDQN is generally accepted as the current “best” method for ameliorating the effects of bootstrapping. An additional variant on this approach is to use two target networks, evaluate which action to select using both target networks, and choosing the action that gives the lower Q-value.

In general, all of these methods of mitigating the chasing-one’s-tail problem of bootstrapping involve using a target network and decoupling the action selection from the evaluation of the state-action pair’s value in calculating the Bellman residual.

3. Off-policy Learning

In the context of the Deadly Triad, we primarily use the term “Off-policy learning” to refer to RL setups that utilize experience replay during training. Specifically, we utilize data generated using an older version of our Q-function under an $\epsilon$-greedy sampling strategy to improve our current estimate of the Q-function when we operate under it with $\epsilon = 0$. For more detail on on- and off-policy learning, please see the blog post on the subject. Looking at that blog post and what it says about Deep Deterministic Policy Gradients (DDPG), we know that off-policy learning says that $Q_{\theta}(s, a)$ is not a critic of $\pi’
(\cdot | s)$, an arbitrary policy network. Instead, it is a critic of $\pi^*(s) = \underset{a}{\operatorname{argmax}}~Q(s,a)$. The process by which we generate the state-action pairs to train our Q-function can be different from what the critic is evaluating; that is, we decouple sampling/exploration from what the critic is judging. This decoupling is generally considered preferable for sample-efficiency reasons, among others. Off-policy learning, however, is difficult to do well in the context of experience replay.

What is Experience Replay?

In experience replay, we have a database, $D = \{ \langle s_0, a_0, r_0, s_0′ \rangle, \langle s_1, a_1, r_1, s_1′ \rangle, \dots \}$, in which we collect sample experiences from which we train our policy network. These databases, often called replay buffers, are key components of many RL frameworks.

Even though we train from a replay buffer of past experiences, we can modulate the degree to which we train our network off-policy. Q-learning is not technically off-policy, as, when taking the expectation of the Bellman residual, the states from which we compute the expectation must be drawn from somewhere. The distribution from which the states are drawn in computing that expectation should be a uniform distribution over all states, as in Value Iteration, but instead, we draw the state-action pairs from the replay buffer, whih is biased towards some state-action pairs over others based upon our sampling history. The replay buffer is skewed towards the policy used to generate the experiences
within it.

In the paper that inspired this blog post, van Hasselt, et al. explore prioritizing the experiences in their agent’s replay buffer by setting the probability of sampling a specific point $p_i$ from the replay buffer to be proportional to the error—the Bellman residual. That is, they rank the samples in the replay buffer according to how wrong the function approximator is when evaluating them and then train the Q-function to prioritize selecting points that are ranked as being more incorrect so that the learner can fix its evaluation of the Q-value estimate of those experiences. While this approach seems reasonable—a human who is very unskilled at a specific task who practices that task is sure to learn to improve their performance of that task, even if only a little—it actually increases the likelihood of the Q-network diverging when compared to training the Q-network on uniformly-sampled experiences. The takeaway here is that, while prioritizing specific examples in a replay buffer is an interesting area of research, if we just want our learner to learn, sampling non-uniformly from our replay buffer could make things worse. An additional divergence risk is the size of the neural network itself. In general, for DQN, the risk of the function approximator diverging increases as the network size increases. Conversely, for DDQN, the risk of the network diverging is agnostic of network size.

Conclusions

DQN and its variants are frequently used because, not only are they widely applicable, but they work in a lot of scenarios. The Deadly Triad—function approximation, bootstrapping, and off-policy learning— can all cause complications in the application of DQN, however; in this blog post, we covered these challenges and common approaches to handling them. The main takeaways from this blog post are:

  1. Function approximation does not have the same nice guarantees that tabular learning has, but it is far more flexible when the state and/or action spaces are large.
  2. To ameliorate the instability bootstrapping can cause, consider either using a multi-step update in the Bellman residual or using DDQN.
  3. When using a replay buffer, sample from the replay buffer uniformly—don’t try to prioritize any experiences over others

Many thanks to Ruisen “Eric” Liu for editing this blog post and providing suggestions for how to improve it.

References

  1. Mnih, Volodymyr, et al.“Playing Atari With Deep Reinforcement Learning.” arXiv preprint arXiv:1312.5602 (2013).
  2. van Hasselt, Hado, et al. Deep Reinforcement Learning and the Deadly Triad. arXiv preprint arXiv:1812.02648 (2018).
  3. R. S. Sutton and A. G. Barto. Reinforcement Learning: An Introduction, 2nd ed. The MIT Press, Cambridge MA, 2018.
  4. Lillicrap, Timothy P., et al.Continuous Control With Deep Reinforcement LearningInternational Conference on Learning Representations, Sep. 2016.

Filed Under: Bootcamp

Bootcamp Summer 2020 Week 4 – On-Policy vs Off-Policy Reinforcement Learning

February 28, 2022 by mschrum3 and Matthew Gombolay

 

Reinforcement learning algorithms can be divided into two main types: on-policy and off-policy. Understanding the differences between on-policy and off-policy learning is essential to understanding and implementing various reinforcement learning algorithms. Below we will discuss the key differences in on-policy and off-policy learning, the algorithms that employ these two different approaches, and the implications of these learning approaches.

  1. What is on-policy versus off-policy?
  2.  Examples of on-policy and off-policy algorithms
  3. Implications
  4. Advantages and Disadvantages
  5. Takeaways
Break-down of types of reinforcement learning algorithms.

 

1. What is on-policy versus off-policy?

On-policy and off-policy learning fall under the category of model-free reinforcement learning algorithms, meaning that we do not have access to the transition probability distribution. This is opposed to model-based methods such as Monte-Carlo Tree Search (MCTS).

To begin our discussion of on-policy and off-policy reinforcement learning, we will first introduce some terminology. In a reinforcement learning setting, an agent interacts with an environment, moving from state, $s$, to $s’$ via action, $a$, and transition probability, $T(s,a,s’)$. The agent’s policy, $\pi(s)$ dictates the action that an agent takes in state, $s$. The objective of reinforcement learning is to find the optimal policy, $\pi^*$, which maximizes future discounted expected reward. The Q-function, $Q(s,a)$, describes the value of a state action pair. If we know the optimal Q-function, $Q^*$, which describes the maximum expected reward of any given state-action pair, then we can extract the optimal policy according to Eq. 2.

\begin{equation}
Q^*=R(s,a)+\gamma \mathop{\mathbb{E}}_{s’}[V^*(s’)] \end{equation}

\begin{equation}
\pi^*(s)=argmax_{a}Q^{*}(s,a) \quad \forall s
\end{equation}

 

To understand the difference between on-policy learning and off-policy learning one must first understand the difference between the behavior policy (i.e., sampling policy) and the update policy.

Behavior Policy: The behavior policy is the policy an agent follows when choosing which action to take in the environment at each time step. In Q-learning, our behavior policy is typically an $\epsilon$-greedy strategy. In an $\epsilon$-greedy behavior policy, an agent chooses the action dictated by the current policy, $\pi(s)$, with probability 1-$\epsilon$ and a random action with probability $\epsilon$. The behavior policy generates actions and determines how the agent interacts with the environment.

Update Policy: The update policy is central to understanding the difference between on-policy and off-policy learning. The update policy is how the agent updates the Q-function. In other words, it dictates how the agent derives the state-action pairs which are used to calculate the difference between the actual Q-value and current predicted Q-value, also known as the TD-error. The TD-error is then used to update the Q-function.

This brings us to the key difference between on-policy and off-policy learning: On-policy algorithms attempt to improve upon the current behavior policy that is used to make decisions and therefore these algorithms learn the value of the policy carried out by the agent, $Q^{\pi}$. Off-policy algorithms learn the value of the optimal policy, $Q^*$, and can improve upon a policy that is different from the behavior policy.  Determining if the update and behavior policy are the same or different can give us insight into whether or not the algorithm is on-policy or off-policy. If the update policy and the behavior policy are the same, then this suggest but does not guarantee that the learning method is on-policy. If they are different, this suggests that the learning method is off-policy.

Let’s take a look at some examples.

2. Examples

Here we discuss several examples of on-policy versus off-policy algorithms and highlight the key differences between them. The important lines in each algorithm are shown in red.

Below is the pseudo-code for SARSA (State, Action, Reward, State, Action), an on-policy algorithm.

SARSA often utilizes an $\epsilon$-greedy behavior policy and follows the same policy when updating the Q-function, meaning that $a’$ is also selected based upon an $\epsilon$-greedy strategy as shown in Line 7. Thus, because SARSA learns the quality of the behavior policy rather than the quality of the optimal policy, SARSA is an on-policy algorithm.

Below is the pseudo-code for Q-learning.

The key difference between Q-learning and SARSA is the
$max_{a’} Q(s’,a’)$ term found in the Q-learning algorithm. This $\max$ term means that Q-learning is learning the value of the optimal policy. SARSA’s Q-function models the q-values of the behavior policy whereas Q-learning learns the optimal q-values. This distinction is what makes SARSA on-policy and Q-learning off-policy.

Let’s take a look at another algorithm, Deep Deterministic Policy Gradient (DDPG). DDPG is an actor-critic method. This means that it concurrently learns both a policy as represented by the actor and the Q-function as represented by the critic. DDPG also utilizes experience replay in which it stores state-action-state-reward tuples in a buffer and samples these examples during training. We focus on the part of DDPG that is relevant to our on-policy versus off-policy discussion. Below is a subsection of the pseudocode. $\pi_{\theta}$ describes the policy the agent is learning parameterized by $\theta$ and the Q-function is parameterized by $\phi$.

We see in Line 6 that DDPG utilizes a stochastic behavior policy defined by the actor $\pi$ plus exploration noise. What about the update policy? The update policy is shown in Line 11. Unlike in Q-learning, there is no max when selecting the next action in Line 11. Instead we see that we are using the actor $\pi_{\theta}$ to selection $a’$ which sounds like DDPG is optimizing its current policy and therefore is an on-policy algorithm. So does this mean that, because DDPG is updating the Q-function via $Q_{\phi}(s’,\pi_{\theta}(s’))$ instead of $\max_{a’}$ $Q_{\phi}(s’,a’)$, that DDPG is on-policy? No! Based on our definition, an on-policy algorithm improves upon the current policy, whereas an off-policy algorithm learns the value of the optimal policy. With DDPG, as time goes to infinity, the actor selects the action which maximizes the Q-function as shown in Line 12. Because at convergence, $\pi_{\theta}$ is optimal and therefore $Q_{\phi}(s’,\pi_{\theta}(s’))=\max_{a’}$ $Q_{\phi}(s’,a’)$, DDPG is evaluating the optimal policy, making it an off-policy algorithm.

To the right is a list of other common on-policy and off-policy algorithms. The same analysis of the behavior policy and the update policy can be used to verify the type of learning that each of these algorithms employs.

3. Implications

So what are the implications of these differences? In the end does the type of learning matter? When should we use which algorithm?

This figure depicts the cliff walking domain. The SARSA agent learns the safer path shown in blue and the Q-learning agent learns the optimal path shown in red [1].

Different Policy Outcomes: First of all, these different training methods can result in different behaviors due to the fact that on-policy methods learn $Q^\pi$ and off-policy methods learn $Q^*$. For example, in Q-learning because the update policy follows a greedy policy, it assumes that the agent will act optimally in the future and therefore, Q-learning will learn the optimal policy as long as all states and actions are sufficiently experienced. As shown in the above figure, this means that the cliff walking agent will learn to take the optimal path from the start to goal even though this path may be dangerous if the agent’s actions are stochastic.

SARSA, on the other hand, assumes that the agent will follow the current policy in the future. This means that when updating the Q-function, we assume that the cliff walking agent will at times jump over the cliff if it travels too close to the edge. Therefore, SARSA learns a policy which is safer and ensures that the agent will avoid the cliff. SARSA only learns a near-optimal policy that depends on the behavior strategy. For example, if the behavior strategy is $\epsilon$-greedy then SARSA will learn at optimal $\epsilon$-greedy strategy. To more closely approximate the optimal policy, one can decay $\epsilon$ to decrease exploration over time. SARSA is guaranteed to converge to the optimal policy if the behavior policy converges to a greedy policy and all states and actions are visited an infinite number of times.

 

This figure shows the rewards received by an agent trained via SARSA and an agent trained via Q-learning in the cliff walking domain. In this domain, SARSA achieves higher rewards than Q-learning despite learning a sub-optimal policy [1].

Although Q-learning learns the optimal strategy, SARSA may have better online performance. As shown in the plot, SARSA obtains higher episodic rewards on average than Q-learning in the cliff domain due to the fact that the agent takes a safer route. However, this may not be the case for all domains.

When to employ which learning strategy: If, for example, one is utilizing reinforcement learning to train a physical robot then one may opt for a more conservative and safer learning scheme such as the on-policy SARSA. SARSA also tends to be more stable while training. However, if training in simulation, Q-learning may be more effective since it is more likely to learn the optimal policy and less likely to become trapped in a local minimum. Additionally, off-policy algorithms can take advantage of experience replay when the behavior policy is changing or even if the behavior policy is generated from a source other than the learning agent (e.g., a human demonstrator). On-policy algorithms can only utilize experience replay when the behavior policy is static. Experience replay allows us to utilize historical data when learning which can help de-correlate training examples and is useful when gathering experience is costly. Additionally, off-policy learning is agnostic to how the data is generated and can therefore utilize data generated from other sources. Consequently, off-policy learning may be preferable in situations in which the data must be generated in a way that does not follow the current policy, for example, via human demonstrations.

4. Advantages and Disadvantages

5. Takeaways

  1. To identify if a reinforcement learning algorithm is off-policy or on-policy, one must first identify the behavior policy and the update policy. If the algorithm attempts to improve the behavior policy, it is on-policy. If however, the algorithm learns the optimal policy, then it is off-policy.
  2. SARSA is an example of on-policy learning and Q-learning is an example of off-policy learning. Algorithms such as DDPG, which learn the quality of the optimal policy, are off-policy strategies.
  3. The strategies that results from on-policy learning versus off-policy learning can differ in a number of ways including how safe the policy is, its convergence guarantees, and the online performance.

References
[1] Richard S. Sutton and Andrew G. Barto. Reinforcement Learning: An Introduction. The MIT Press, second edition, 2018.

 

Filed Under: Bootcamp

AAMAS 2022 Tutorial – Graph Neural Networks: Learning Representations of Robot Team Coordination Problems

January 18, 2022 by Matthew Gombolay and Zheyuan Wang

Tutorial at the International Conference on Autonomous Agents and Multi-Agent Systems 2022

Overview

Robot teams are increasingly being deployed in environments, such as manufacturing facilities and warehouses, to save cost and improve productivity. To efficiently coordinate multi-robot teams, fast, high-quality scheduling algorithms are essential to satisfy the temporal and spatial constraints imposed by dynamic task specification and part and robot availability. Traditional solutions include exact methods, which are intractable for large-scale problems, or application-specific heuristics, which require expert domain knowledge. What is critically needed is a new, automated approach that automatically learn lightweight, application-specific coordination policies without the need for hand-engineered features.

This tutorial is an introduction to graph neural networks and a showcase of the power of graph neural networks solving multirobot coordination problems. We survey various frameworks of graph neural networks in recent literature, with a focus on their application in modeling multi-agent systems. We will introduce the multi-robot coordination (MRC) problems and review the most relevant methods available to solving MRC problems. We will discuss several successful applications of graph neural networks in MRC problems, with hands-on tutorials in the form of Python example code. With this tutorial we aim at providing an experience of employing graph neural networks in modeling multi-robot systems, from algorithm development to code implementation, thus opening up future opportunities in designing graph-based learning algorithms in broader multi-agent research.

Content

In this tutorial, we will discover the power of graph neural networks for learning effective representations of multi-robot team coordination problems. The tutorial will feature two 90-minute online sessions.

Date and Time: May 8, 11:00 – 14:00 (EDT)

  • 3:00 – 6:00, May 9, Auckland Time
  • 11:00 – 14:00, May 8, New York Time
  • 17:00 – 20:00, May 8, Paris Time
  • 20:30 – 23:30, May 8, Kolkata Time
  • 23:00 – 2:00 (+1 day) May 8, Beijing Time

Zoom link: https://gatech.zoom.us/j/93966169541 [Video recordings are posted below]

Session #1 by Matthew Gombolay (11:00 – 12:30)

The first session will address the following: (a) How graph neural networks work – we will present a comprehensive overview of various graph neural networks proposed in prior literature, including both homogeneous and heterogeneous graphs and attention mechanisms; (b) How to model team coordination problems with graph neural networks – we will discuss what are the suitable applications that can be modeled in graph neural networks, with a focus on MRC problems; (c) How to optimize the parameters of graph neural networks for team coordination problems – we will discuss what learning methods can be used for training a graph neural network-based solver. We conclude this session with the most recurrent challenges and open questions.

Tutorial notes: [Slides]

Video:

Session #2 by Zheyuan Wang (12:30 – 14:00)

The second session will provide a hands-on tutorial for how to get up and running with graph neural networks for coordination problems, with coding examples in Python Jupyter notebook. In particular, we will look into the ScheduleNet architecture [6], a heterogenous graph neural network-based solver for MRC problems under temporal and spatial constraints. The Jupyter notebook will work through the model implementation, training and evaluation of ScheduleNet models on synthetic dataset.

Python Jupyter notebook: [Open in Google Colab] [Github]

Video:

Target Audience

The target audience is intended to be students and researchers who have a strong grasp of machine learning but who may as of yet be unfamiliar with graph neural networks. Prior knowledge of Multi-agent Reinforcement Learning (MARL) or Planning & Scheduling would be helpful but is not necessary. While the tutorial showcases the application of graph neural networks in solving multi-robot coordination problems, the methodology introduced can be utilized by the audience to broader research problems in learning representations of multi-agent systems.

Presenters

Dr. Matthew Gombolay is an Assistant Professor of Interactive Computing at the Georgia Institute of Technology. He received a B.S. in Mechanical Engineering from the Johns Hopkins University in 2011, an S.M. in Aeronautics and Astronautics from MIT in 2013, and a Ph.D. in Autonomous Systems from MIT in 2017. Gombolay’s research interests span robotics, AI/ML, human–robot interaction, and operations research. Between defending his dissertation and joining the faculty at Georgia Tech, Dr. Gombolay served as a technical staff member at MIT Lincoln Laboratory, transitioning his research to the U.S. Navy, earning him an R&D 100 Award. His publication record includes a best paper award from American Institute for Aeronautics and Astronautics, a finalist for best student paper at the American Controls Conference, and a finalist for best paper at the Conference on Robot Learning. Dr Gombolay was selected as a DARPA Riser in 2018, received 1st place for the Early Career Award from the National Fire Control Symposium, and was awarded a NASA Early Career Fellowship for increasing science autonomy in space.

Zheyuan Wang is a Ph.D. candidate in the School of Electrical and Computer Engineering (ECE) at the Georgia Institute of Technology. He received the B.S. degree and the M. E. degree, both in Electrical Engineering, from Shanghai Jiao Tong University. He also received the M.S. degree from ECE, Georgia Tech. He is currently a Graduate Research Assistant in the Cognitive Optimization and Relational (CORE) Robotics lab, led by Prof. Matthew Gombolay. His current research interests focus on graph-based policy learning that utilizes graph neural networks for representation learning and reinforcement learning for decision-making, with applications to human-robot team collaboration, multi-agent reinforcement learning and stochastic resource optimization.

Reading Materials

  1. Ernesto Nunes, Marie Manner, Hakim Mitiche, and Maria Gini. 2017. A taxonomy for task allocation problems with temporal and ordering constraints. Robotics and Autonomous Systems 90 (2017), 55–70.
  2. Petar Veličković, Guillem Cucurull, Arantxa Casanova, Adriana Romero, Pietro Liò, and Yoshua Bengio. 2018. Graph attention networks. International Conference on Learning Representations (2018).
  3. Xiao Wang, Houye Ji, Chuan Shi, Bai Wang, Yanfang Ye, Peng Cui, and Philip S Yu. 2019. Heterogeneous graph attention network. The World Wide Web Conference (2019), 2022–2032.
  4. Jie Zhou, Ganqu Cui, Shengding Hu, Zhengyan Zhang, Cheng Yang, Zhiyuan Liu, Lifeng Wang, Changcheng Li, and Maosong Sun. 2020. Graph neural networks: A review of methods and applications. AI Open 1 (2020), 57-81.
  5. Zheyuan Wang and Matthew Gombolay. 2020. Learning scheduling policies for multi-robot coordination with graph attention networks. IEEE Robotics and Automation Letters 5, 3 (2020), 4509–4516.
  6. Zheyuan Wang, Chen Liu, and Matthew Gombolay. 2021. Heterogeneous graph attention networks for scalable multi-robot scheduling with temporospatial constraints. Autonomous Robots (2021), 1–20.

 

Filed Under: Bootcamp

Bootcamp Summer 2020 Week 3 – Value Iteration and Q-learning

January 19, 2021 by Matthew Gombolay and Pradyumna Tambwekar

Value iteration and Q-learning make up two fundamental algorithms of Reinforcement Learning (RL). Many of the amazing feats in RL over the past decade, such as Deep Q-Learning for Atari, or AlphaGo, were rooted in these foundations. In this blog, we will cover the underlying model RL uses to describe the world, i.e. a Markov decision process (MDP), as well as two algorithms for performing RL: value iteration and Q-learning. By the end of this blog post, you should be able to understand the connection between value iteration and Q-learning and how to employ either of these algorithms to learn the optimal policy.

This post is divided into three parts:

  1. Markov decision processes
  2. Value functions and Value iteration
  3. Q-learning and Deep Q-learning

Prerequisites:

The only prerequisites you will need to understand this blog are gradient descent and backpropagation. Both of these topics were covered in our first two blogposts (week 1 and week 2).

Markov Decision Processes

An MDP is a 5-tuple describing an environment consisting of the following five components:

  1. States: A state, $s$ is any predefined momentary instance in the world that an agent can exist in. For the rest of this post, we will use the variable $S$ to represent the set of all possible states in the world with $s \in S$ referring to an individual state.
  2. Actions: An action, $a$, is an event facilitated by the agent that can transition you from one state to another provided that such a transition is possible MDP. We will use $A$ to represent the set of all possible actions in the world, with $a \in A$ referring to an individual action. We note that actions may not have deterministic consequences. For example, flipping a coin may not give you the same result each time! The degree to which actions have deterministic effects is described by the transition function.
  3. Transition function: The transition function is a function that defines the probability of moving to a specific next state, given your current state and a valid action. The transition function, $T$, is mathematically defined as follows, $ T: S \times A \times S’ \rightarrow [0,1]$.
  4. Reward: The reward function specifies a real number value that defines the efficacy or a measure of “goodness” for being in a state, taking an action and landing in the next state. Similarly to the transition function, the reward is defined as follows, $R: S \times A \times S’ \rightarrow \rm I\!R $.  Note that the state you end up in may be uncontrollable since state transitions can be dynamic.
  5. Discount Factor: The discount factor can be specified using $\gamma$, where $\gamma \in [0,1)$. Note the non-inclusive upper bound for the discount factor (i.e., $\gamma \neq 1$). Disallowing $\gamma = 1$ allows for an MDP to be more mathematically robust. Specifically, the goal for RL algorithms is often to maximize discounted reward across time. Consider the case of an infinite horizon MDP (i.e., the MDP never ends) in which the rewards are always positive. If the discount factor, $\gamma$, is equal to 1, then the sum of future discounted rewards will be infinite, making it difficult RL algorithms to converge (i.e., know when they can stop determine which actions should be taken in each state). Thus, the choice of gamma is a critical for the success of RL algorithms, e.g. Q-learning and value iteration.

You may have read about a concept known as Markov chains. While Markov Chains will not be covered in this post, it is important to understand the distinction between markov chains and markov decision processes, as both concepts share the markov property. Fundamentally, a markov chain consists of the components of a markov decision process except actions, rewards, and a discount factor. An agent in a markov chain is not in control of their movements, the world is in control of the agent’s movements. In a markov decision process the agent has an influence on the outcomes. Thus the transition function of a markov chain is simply defined as $T: S \times S’ \rightarrow [0,1]$.

Markov Property: The markov property holds when the next state depends only on the current state and current action and is independent of the history of states and actions prior to that.

Nota bene: In some MDPs, you may see the initial state distribution included as a part of the MDP. While that is still a valid representation, in this blog, we are going to remain agnostic of the initial state distribution.

Value functions and Value Iteration

The value function, $V(s)$, is a measure of the expected reward you can receive from any given state $s$ given an MDP and a policy describing which actions an agent takes in each state. Formally, a policy, $\pi: S \rightarrow [0,1]^{|A|}$, is a probability distribution over actions, $a \in A$, conditioned on a state, $s$. For the sake of this blog, we will consider a policy to be deterministic (i.e., $\pi(s,a) = 1$ for a single action, $a$, and $\pi(s,a’)=0$ for all other $a’ \neq a$). Having defined a policy, we can now mathematically define the value function for a policy: $$V^{\pi}(s) = E[\sum_{t=0}^{\infty} \gamma^t r_t | \pi, s_0 = s]$$ In a deterministic world, this expectation could be ignored; however, in general, we include the expectation as both the policy (e.g., an agent that doesn’t always like the same thing for breakfast) and the transition function (e.g., how bad traffic is) could describe nondeterministic attributes of the problem we want to model.

The goal of Value Iteration is to find a policy that maximizes the value function: $$\pi^*(s) = argmax_{\pi \in \Pi} V^{\pi}(s), \forall s \in S $$ In this equation $\Pi$ represents the set of all possible policies within the MDP, and $\pi^*(s)$ is the optimal policy. What the value function equation (i.e., $V^{\pi}(s) = E[\sum_{t=0}^{\infty} \gamma^t r_t | \pi, s_0 = s]$) states is that you are going to follow a policy, $\pi$, starting at state, $s$, for up to an infinite number of time steps, $t$, proceeding forward from $t=0$ to $t = \infty$ along a trajectory (i.e., a sequence of state-action pairs, $(\langle s_1,a_1 \rangle , \langle s_2,a_2 \rangle,\ldots)$), where the accumulated reward, $r_t$, is discounted via multiplication with $\gamma^t$. This process of discounting rewards means that, the further you go into the future, rewards have a diminishing impact on value or “goodness” of being in state, $s$. Adjusting $\gamma$ allows you to define your preferences towards optimizing for short-term vs long-term gain. The expectation matters because, as stated in the previous paragraph, transitions can be non-deterministic, thus the expectation provides a normalized estimate of the discounted reward.

Value iteration is a computational algorithm that provides a means of finding the optimal policy, $\pi^*$. The algorithm works by iteratively determining the value of being in each state, $s,$ assuming that the agent takes the best possible action in that state under the current estimate of the value function. The value iteration algorithm is shown in the figure below.

Value_Iteration
Value Iteration

This algorithm iteratively updates the value, $V(s)$, for each state, $s$ until it reaches a point where the change in value is negligible, as defined by the threshold $\Delta_\circ$. Take note of Line 7 in this algorithm in which we consider a state, $s$, an action, $a$, and the resulting state, $s’$, in a sequence, $\langle s, a, s’ \rangle$. Line 7 works by updating the value for the given state, and is called the value update and is reminiscent of Bellman’s optimality condition. This condition says $$ V(s) = \max_{a \in A} E [R(s,a,s’) +  \gamma \max_{a’}V(s’)] = \max_{a \in A} \sum_{s’\in S} T(s,a,s’)[R(s,a,s’) +  \gamma V(s’)]$$

When this condition holds for a given state, $s$, we would find in Line 8 that $temp – V(s) = 0$. If the condition holds for all states, $s$, then the value function has converged to the optimal value function, $V^{*} = V^{\pi^*}$, and the optimal policy, $\pi^*$, can simply be exctracted by Line 11. In a sense, Value iteration works by iteratively pretending or enforcing (Line 7) that Bellman’s optimality condition holds, measuring if the condition does not hold (Line 8), and, terminating once the condition holds within an acceptable margin of error (Line 3).

Upon convergence (i.e., reaching an acceptable level of error, $\Delta < \Delta_0$), this Value Iteration returns a value function, $V^{*}$, that provides an estimate of the future expected reward while following the optimal policy, $\pi^*$, as well as the optimal policy itself. To be perfectly clear, we should note that the actual value function and policy returned are only approximately optimal, and it would be more technically accurate to refer to them as $V \approx V^*$ and $\pi \approx \pi^*$.

Nota Bene: In some textbooks the bellman’s update equation may be alternatively defined, considering the reward function to be independent of the next state $s’$, i.e. $$V(s) \leftarrow  \max_a R(s,a) + \sum_{s’} T(s,a,s’)\gamma V(s’)$$ However the update rule shown in the value iteration algorithm above is a more complete representation of the value estimate because you cannot always assume that, for any given MDP, your reward will not depend on the next state.

Let us now define surrogate terms, the Q-value and the Q-function. Similar to the value function, the Q-function is a quality measure. However the unlike the value function, the Q-function measures the expected discounted reward for taking a particular action at a given state, i.e. $R: S \times A \rightarrow \rm I\!R$. We refer to the Q-value as what the Q-function returns.

Although Q-learning is the go-to algorithm for estimating Q-values and consequently optimal policies from these Q-values, you may also modify the value iteration algorithm to solve for the Q-values.

Explain Value iteration using a Q function instead of a value function
Value Iteration (To compute Q-values)

There are two primary differences between both algorithms resulting from an introduction of a Q-function. First, since the Q-function is dependent on both states and actions, we need to loop through all possible actions (Line 6) as well as the set of all states. Second, you’ll notice that the update rule (Line 8) has changed upon incorporating the Q-function. Since the estimate for the optimal Q-value, $Q(s’,a’)$, for the next state, $s’$, depends on the next action, $a’$, we need to replace “$V(s’)$” from Line 7 of Algorithm 1 with “$\max_{a’} Q(s’,a’)$” in Line 8 of Algorithm 2.

We note that Algorithms 1 and 2 are often referred to as flavors of “exact RL” because, unlike in function approximation algorithms commonly used in RL (i.e., deep learning-based RL), here, we are guaranteed to able to solve for the true, optimal Q-values within a margin of error, $\Delta$, and given sufficient computational resources. “Approximate” RL techniques (e.g., deep RL) do not typically have such guarantees.

An important question to answer is, then, “Why do we not do this in practice instead of Q-learning?” Unfortunately, most real-world applications of RL contain too many states to iterate over even once, let alone the multitude of times required for the algorithm to converge to an acceptable error threshold. Hence, function approximation algorithms, such as Deep Q-learning are preferred.

Q-Learning and Deep Q-learning

Prior to discussing Q-learning itself, let us define a term called the Bellman Residual.

$ \delta(s,a,s’) = [R(s,a,s’) +  \gamma \max_{a’}Q(s’,a’)] – [Q(s,a)] $$

The bellman residual, denoted by $\delta$, computes a measurement error for Q-learning to describe how wrong the current estimate of the Q-function is given a transition sequence, $\langle s,a,s’ \rangle$. When the Q-function is optimal, we should find that the modified form of the Bellman optimal condition holds: $$Q(s,a) = E[R(s,a,s’) + \gamma \max_{a’\in A} Q(s’,a’)] = \sum_{s’} T(s,a,s’)[R(s,a,s’) + \gamma \max_{a’\in A} Q(s’,a’)]$$

When this condition holds, we should find that $ \delta(s,a,s’) = 0 $ for all possible transition sequences. However, when we first initialize a Q-function, as we do in Line 2 of Algorithm 3, our Q-function will almost certainly be wrong, and $ \delta(s,a,s’)$ will not equal zero. So, how do we improve our Q-function to be correct?

The way Q-learning works is by “blending” our current estimate of the Q-function, $Q(s,a)$, with a “point-estimate” of what the Q-function should be, i.e. $R(s,a,s’) +  \gamma \max_{a’}Q(s’,a’)$. The degree of blending is controlled by a hyperparemeter, $\alpha$ in the following equation: $$ Q(s,a) = (1 – \alpha) Q(s,a) + \alpha (R(s,a,s’) +  \gamma \max_{a’}Q(s’,a’)) $$

If $\alpha = 1$, then we completely ignore our current estimate of the Q-function. If $\alpha=0$, then we completely ignore new information. A decent starting place for setting $\alpha$ in practice is to choose $\alpha = 0.99$, which puts most of the emphasis on the current estimate of the Q-function. However, we encouage you to experiment with this parameter!

The full algorithm for Q-learning is shown in the algorithm pictured below.

Q-Learning Algorithm
Q-Learning Algorithm

It is important to note that, unlike in the version of the bellman equation described earlier, Q-learning does not include a transition function in its Bellman update equation! In Q-learning, we instead simulate actions within the environment and utilize the states visited during the simulation (i.e., the trajectory) to apply the bellman equation. The simulate function in the algorithm shows how trajectories are sampled in Q-learning. In each timestep of the trajectory, we use an epsilon-greedy sampling strategy to select an action. If a randomly sampled probability is greater than a predefined $\epsilon$ value, then we greedily sample the action, otherwise we randomly sample an action from the space of all actions. Upon selection, we simulate that action and observe the next state and reward received for that action. Finally we store all this observed information into the memory buffer, update the current state and repeat this process until we reach a terminal state. The trajectory computed from each simulation is then used to update the Q-values via the Bellman update equation (line 6 in Q-learning).

The absence of a transition function makes Q-learning a model-free RL algorithm, as it does not need any prior knowledge of “the world”  to learn the optimal policy. This model-free characteristic is significant because, in practice, you rarely have access to the transition probabilities for all state action pairs in real-world application.

With enough time and sampled trajectories, Q-learning will be able to estimate the optimal Q values for each state action pair. However, as currently constructed, the Q-learning algorithm requires a lookup table for each possible state action pair, and filling in those values. The size of the state-action spaces in real-world tasks makes maintaining a lookup table of Q-values infeasible. One possible solution to this lack of computational power is the use of function approximation techniques, e.g. Deep Q-learning. The term “deep” comes from parameterizing Q-values using neural networks. Therefore, instead of learning a literal table of Q-values, our goal shifts towards learning the weights of a neural network that can output the Q-value for any given state action pair. Employing the square of the Bellman residual as our loss function, we can apply the backpropagation algorithm (Covered in Week 2) to learn the weights of the Q-network.

Deep Q-learning
Deep Q-learning

When compared to Q-learning, there are three primary reasons why Deep Q-learning is better suited for real world applications,

  1. Not enough memory: Q-learning requires storing a lookup table of all states and actions, leading to an infeasible RAM requirement for most modern applications.
  2. Not enough time: Deep Q-learning is often significantly faster to train because of the ability to incorporate techniques like batched gradient descent and adaptive optimization. In addition to this, Deep Q-learning does not require convergence of the Q-value for each state action pair.
  3. Interpolation: Deep Q-Learning is a function approximation algorithm. Because the Q-network approximates Q-values for states and actions, we can assume that the network can interpolate q-values for similar state and action pairs i.e, $ Q(s,a) \approx Q(s’,a’)$ if $(s,a) \approx (s’,a’)$. In practice, Deep Q-learning is more efficient than seeking convergence for each individual state-action pair. However, there are pathological cases where this may not be true, for example, when the slightest change in the state or action may result in drastically different outcomes.

While we have not yet covered enough yet to fully implement Deep Q-learning and understand the tips-and-trick in practically implementing a Deep Q-learning model, please stay tuned for a future blog on that topic!

Key Takeaways from this blog:

  1. A Markov Decision Process(MDP) is a tuple used to describe an environment and is made up of components used in reinforcement learning algorithms.
  2. The goal of most reinforcement learning algorithms is to maximize the cumulative expected reward, which is also known as the value function.
  3. Value iteration is an iterative algorithm that uses the bellman equation to compute the optimal MDP policy and its value.
  4. Q-learning, and its deep-learning substitute, is a model-free RL algorithm that learns the optimal MDP policy using Q-values which estimate the “value” of taking an action at a given state.

Filed Under: Bootcamp Tagged With: bellman equation, Markov Decision Process, MDP, Q-Learning, Reinforcement learning, Value Iteration

Bootcamp Summer 2020 Week 2 — Neural Networks and Backpropagation

January 18, 2021 by Rohan Paleja and Matthew Gombolay

Previously, we covered gradient descent, an optimization method that adjusts the values of a set of parameters (e.g., the parameters of a linear regression model) to minimize a loss function (e.g., the mean squared error of a linear regression model network in predicting the value of a home given its zip code and other information). In this post, we introduce neural networks, which are essentially hyperparameter function approximators that learns to map a set of inputs to a set of outputs.

To optimize a neural network model with respect to a given loss function, we could directly apply gradient descent to adapt the parameters of a neural network to learn a desired mapping of inputs to outputs, this process would be inefficient. Due to the large number of parameters, performing symbolic differentiation as introduced in our gradient descent lesson would require a lot of redundant computation and slow down the optimization process tremendously. Fortunately, there is a better way: the backpropagation algorithm. Backpropagation is a form of auto-differentiation that allows us to more efficiently compute the derivatives of the neural network (or other model’s) outputs with respect to each of its parameters. Backpropagation is often overlooked or misunderstood as a simple application of symbolic differentiation (i.e., the chain rule from Calculus), but it is much, much more.

This blog will cover

  1. Neural Networks
  2. Forward Propagation
  3. Backpropagation

We will not take the time to discuss the biological plausibility of neural networks, but feel free to check out this paper.

Neural Networks

We will start by understanding what a neural network is. At its core, a neural network is a function approximator. Functions from algebra, such as $y=x^2$ and $y=2x+1$, can be represented by neural networks. Neural networks can be used to learn a mapping given a set of data. For example, a neural network could be used to model the relationship of the age of a house, the number of bedrooms, and the square footage with the value of a house (displayed in Figure 1).

A sample relationship a neural network can be expected to learn.

While the example in Figure 1 may appear trivial, neural networks can solve much more challenging problems. For example, neural networks have been applied to take as input Magnetic Resonance Imaging (MRI) data and output diagnoses for the patient based upon what is present in the MRI. Neural networks have also had success in being applied to problems of “machine translation” in which one attempts to translate one language to another (e.g., English to Chinese).

Typically, neural networks are given a dataset $\mathcal{D}=\{(x_k,y_k),\ k\in [n]\}$, where $(x_k,y_k)$ is the   “example” number $k$, $x_k$ is the input to a model we want to train, and $y_k$ is the “label” we want our model to learn to predict from $x_k$. We generalize the notation to $x$ being the input data, and $y$ being the set of ground-truth labels. A neural network can be represented as a function $\hat{y} = f(x)$ (i.e., $f: X \rightarrow Y$). Here, we use “$\hat{y}$” as the output of the neural network instead of “$y$,” as $\hat{y}$ is our estimate of the ground-truth label, $y$. Later, we will use the difference (i.e., error) between $\hat{y}$ and $y$, to optimize our neural network’s parameters.

If we look inside the function $f$, we see nodes and edges. These nodes and edges make up a directed graph, in which an input is operated on from left to right.

A more colorful inside look into a neural network. The output node in this example neural network uses a linear activation, denoted by the diagonal line (“/”).

 

A closer look inside a smaller neural network is shown in Figure 2. Figure 2 displays how an input, $x$ (with cardinality $|x|=d$), is mapped to an output, $\hat{y}$, a scalar. Before we move into what is happening here, let’s will first explain the notation.

We have neural network weights, denoted by $w$, that weight the incoming value through multiplication. The subscripts represent the (to, from) nodes respectively, and the superscript denotes the layer number resulting in the notation as $w^{(layer)}_{to,from}$. The terms with the number “1” inside are termed bias nodes and are represented in the notation “b_{to}”.

As an example, we present a simple neural network describing $\hat{y} = 2x + 4z +3$ in Figure 4.

An example neural network representing $\hat{y}=2x + 4z +3$

When an input is passed into a neural network, the input $x$ is multiplied by respective weights (denoted by $w$) and added to by a bias term $b$, resulting in a new value represented within a hidden node. In our model, we set the number of nodes in each layer to $n$. Each node, along with its incoming edges, represents some inner function. As our neural network can have many nodes and many layers, neural networks can be thought of as a composition of these inner functions to produce a more complex, outer function $f$ described above. The backslash in the most-right node represents a summation, resulting in a scalar value output. Now we will discuss precisely how to produce an output $\hat{y}$ given an input $x$ through learning a mapping.

Forward Propagation

The final output $\hat{y}$ is computed from the node values and weights of the previous layer, in our case, layer 2. We can multiply the output of the hidden nodes in layer 2, denoted by $o^{(2)}_i$, where $i$ represents a node index from $i$ to $n$ (i.e., $i \in \{1,2,\ldots,n\}$, by the weights connecting these nodes with the output node and add the bias term b^{(3)}_{i}. We display this process in Equation 1.

\begin{align}
\hat{y} = \sum_{i=1}^n w_{1,i}^{(3)} o^{(2)}_i + b^{(3)}_{1}
\end{align}

But, what is $o_i^{(2)}$ equal to? We display how $o_i^{(2)}$ is computed in Equation 2.

\begin{align}
o_i^{(2)} = g(z_i^{(2)})
\end{align}

Here, $i$ is again the node index and $g$ is an activation function (i.e., $g: Z \rightarrow O$). Neural networks cannot represent nonlinear functions without utilizing nonlinear activation functions. Activation functions help neural networks represent more complex functions by transforming node values by a nonlinear function. Common activation functions used include ReLU, tanh, sigmoid, etc. Here, we will discuss and use the ReLU (Rectified Linear Unit) as our activation function.

For our example, we will use the ReLU activation function, as shown in Equation 3.

\begin{equation}g(z) =\bigg\{\begin{aligned} &z \text{ if }z \geq 0\\ &0 \text{ otherwise}\end{aligned}\bigg\}\end{equation}

It is also helpful to know the gradient of the ReLU function, displayed in Equation 4.

\begin{equation}g'(z) =\bigg\{\begin{aligned} &1 \text{ if }z \geq 0\\ &0 \text{ otherwise}\end{aligned}\bigg\}\end{equation}

Moving back to Equation 2, we solve for $z_i^{(2)}$ in Equation 5.

\begin{align}
z^{(2)}_i = \sum_{j=1}^n w_{i,j}^{(2)} o^{(1)}_j + b^{(2)}_{i}
\end{align}

This procedure can be used throughout the neural network, computing the current layer using the previous layer’s output. For simplicity of algorithmic notation, you can consider the neural network’s input, $x$, as $o^{(0)}$. Here, $o^{(0)}$ represents the hypothetical output of a hypothetical zeroth layer. In Algorithm 1, we display the complete forward propagation algorithm, expressing the layer number as a variable for generalizability. This algorithm computes a predicted output, represented by $\hat{y}$, given input features, $x$.

We can now analyze the computational complexity of this algorithm. For the sake of simplifying the analysis, we assume the number of layers in the neural network, $l$, is equal to the number of nodes in each layer, $n$, which we also set as equal to the size of the input, $|x|=d$. Thus, $n=l=w$. In Algorithm 1, Steps 3 and 5 represent a matrix multiplication combined with addition. This process has a complexity of $O(n^2+n)$. As a reminder, for any matrix multiplication of two matrices of size $p \times q$ and $q \times r$, respectively, the computational complexity of this multiplication is $O(pqr)$. Thus, since we are multiplying weight matrices of size $n \times n$ to a vector of size $n \times 1$, the complexity of the matrix multiplication alone is $O(n \times n \times 1) = O(n^2)$. The addition operation adds complexity $O(n)$, resulting in $O(n^2+n)$ for Steps 2-6. Steps 8-10 have a complexity of $O(n)$ leading, resulting now in the complexity $O(n^2 + 2n)$. Since Step 1 iterates $n=l$ times, we arrive at a total complexity of $O(n(n^2 + 2n))=O(n^3)$ for Algorithm 1.

Now that we can perform a forward pass and know its complexity, we can discuss how to update our weights and attempt to learn a model that fits the data.

Backpropagation

To optimize our neural network weights to more accurately represent our data, we first establish a loss function that quantifies our model’s error (i.e., how wrong its outputs are for a desired task). Given this loss function, we could then compute the gradients of our loss function with respect to each weight. We could then perform gradient descent to optimize our model parameters as we did in the gradient descent blog post. However, we will now do something a bit more sophisticated to make the optimization process more computationally efficient.

Let’s consider a common loss function, as shown in Equation 6. This loss function represents a mean-squared error loss between the true label (true y-value in the data) and the predicted value using our neural network ($\hat{y}$).

\begin{align}
L=\frac{1}{2} (y-\hat{y})^2 = \frac{1}{2} (y-f(x))^2
\end{align}

The derivative of this loss function with respect to $w_{1,1}^{(1)}$ is displayed in Equation 7 according to chain rule.

\begin{align}
\frac{\partial L}{\partial w_{1,1}^{(1)}} = (y-\hat{y}) * \frac{\partial \hat{y}}{\partial w_{1,1}^{(1)}} \bigg|_x
\end{align}

So, how do we comput $\frac{\partial \hat{y}}{\partial w_{1,1}^{(1)}}$? There are 4 paths from the output node to $w_{1,1}^{(1)}$, which highlighted in blue in Figure 4. The upper-most path has the gradient shown in Equation 8.

\begin{align}
\frac{\partial \hat{y}}{\partial w_{1,1}^{(1)}} =  \frac{\partial \hat{y}}{\partial o^{(3)}} \frac{\partial o^{(3)}}{\partial z^{(3)}} \frac{\partial z^{(3)}}{\partial o^{(2)}_1} \frac{\partial o^{(2)}_1}{\partial z^{(2)}_1} \frac{\partial z^{(2)}_1}{\partial o^{(1)}_1} \frac{\partial o^{(1)}_1}{\partial w_{1,1}^{(1)}}
\end{align}

The rule for computing the “total derivative” encompassing all of these paths can be seen through an example here.

A depiction of neural network dependencies on $w_{1,1}^{(1)}$.

Computing a single partial derivative is computationally expensive, and computing each weight’s partial derivative individually (e.g., (e.g., $\frac{\partial \hat{y}}{\partial w_{1,1}^{(1)}}$ and then $\frac{\partial \hat{y}}{\partial w_{1,2}^{(1)}}$) and so forth all the way to $\frac{\partial \hat{y}}{\partial w_{1,n}^{(l)}}$) would be terribly inneficient.

If we instead look at $\frac{\partial \hat{y}}{\partial w_{1,2}^{(1)}}$, you can see that it shares many of the same nodes as $\frac{\partial \hat{y}}{\partial w_{1,1}^{(1)}}$. In Figure 5, we display a depiction of gradient computations required for $w_{1,1}^{(1)}$, $w_{2,1}^{(1)}$, and the overlap in the computations.

A depiction of neural network dependencies on $w_{1,1}^{(1)}$ (left), neural network dependencies on $w_{2,1}^{(1)}$ (middle), and the overlap between depedencies.

We would like to take advantage of this overlap in computation, which is exactly where the Backpropagation algorithm comes in. Backpropagation is a form of reverse-mode automatic differentiation that allows us to eliminate redundant computation in applying the chain rule to compute the gradient of our neural network.

Applied to neural networks, the backpropagation algorithm starts from the last layer of the neural network, computes an error term (denoted as $\delta$), and multiplies this error term by the gradient. The algorithm computes each node’s respective gradient in memory and reuses these variables as the algorithm continues, backpropagating the information for efficiency. We display the algorithm for Backpropagation in Algorithm 2. The output of the Backpropagation algorithm is gradients for each set of weights and biases.

Looking at the computational complexity of this algorithm, we see Step 4 has a complexity of $O(1)$. Step 6 has a complexity of $O(n^2 + n)$ or $O(n^2)$. Step 10 has a complexity of $O(n)$. Steps 12 and 14 both have complexities of $O(n^2)$. Since we iterate over the number of layers, equal to $n$ for our analysis, the total complexity is $O(n^3)$. If we did not use the Backpropagation algorithm, we would end up redundantly computing each component’s derivatives multiple times. Using the previous assumptions setting the cardinality of $|x|$, the number of nodes, and the number of layers to $n$, we would be iterating over $O(n^3 + n^2)$ weight derivatives resulting in a total complexity of $O(n^6)$.

The last step in optimizing our neural network model to minimize our loss function is to update weights and biases. Equations 9-10 show one step of gradient descent using the derivatives from our backpropagation procedure in Algorithm 2. Note that Equations 9-10 represent only one step of gradient descent, so one would need to iteratively perform the forward pass, then backward pass, then parameter updates (Equations 9-10) until the model converges to a local minimum, as described in our previous post on gradient descent.

\begin{align}
w^{(l)} \to w^{(l)} + \alpha \Delta w^{(l)}
\end{align}
\begin{align}
b^{(l)} \to b^{(l)} + \alpha \Delta b^{(l)}
\end{align}

We now have the ability to train neural networks in a computationally efficient manner, which, in turn, allows us to do more training iterations per unit time, which should help us train more accurate neural networks!

Key Points throughout this Blog:

  1. Neural networks can be used to represent almost any function.
  2. Forward propagation is the process of taking input features, $x$, and computing an output in the form of a model’s estimate, $\hat{y}$.
  3. We can use the difference in the output, $\hat{y}$, of the neural network and our ground-truth data, $y$, to compute a loss, and then use this loss value to compute gradients in the backpropagation algorithm.
  4. Backpropagation is a special case of reverse-mode automatic differentiation and allows for efficient gradient computation. The result is that the training of extremely large neural networks is possible!

Filed Under: Bootcamp

  • Page 1
  • Page 2
  • Next Page »

Primary Sidebar

Secondary Sidebar

People

  • Matthew Gombolay
  • Rohan Paleja
  • Andrew Silva
  • Erin Hedlund
  • Pradyumna Tambwekar
  • Zac Chen
  • Manisha Natarajan
  • Nina Moorman
  • Josh Bishop
  • Sean Ye
  • Zulfiqar Zaidi

Copyright © 2023 · eleven40 Pro on Genesis Framework · WordPress · Log in