• Skip to primary navigation
  • Skip to content
  • Skip to primary sidebar
  • Skip to secondary sidebar

CORE Robotics Lab

  • Home
  • Publications
  • People
  • Robots
  • Blog
  • Contact Us

Bootcamp

Bootcamp Summer 2020 Week 7 – DQN And The Deadly Triad, Or, Why DQN Shouldn’t Work But Still Does

July 9, 2022 by Laura Strickland and Matthew Gombolay

Deep Q Networks (DQN) are an effective deep learning method for solving many problems. DQN is not, however, foolproof and can be quite difficult to train effectively or can simply fail to learn. The authors of this paper discuss the three main reasons that DQN can fail—the so-called Deadly Triad—and why DQN can sometimes work anyway.

The components of the “Deadly Triad” are:

  1. Function Approximation
  2. Bootstrapping
  3. Off-policy Learning

In this blog post, we define and discuss each of these and some ways in which they can be mitigated.

Why is this important? Everything we know suggests that DQN should not work, and if it does, we got lucky with our setup. But DQN does work, at least some of the time, and more reliably than one might expect. This suggests that we don’t yet fully understand what is going on in DQN. In the remainder of this blog post, we cover what the components of the deadly triad are, the problems they can cause, and some of the ways we can sidestep their undesirable effects.

1. Function Approxmiation

Function approximation in Reinforcement Learning gives us the ability to better handle large state and action spaces by approximating the function we are trying to learn (e.g. a Q-function) rather than learning the function exactly. First, we cover the basics of how RL can work without function approximation, then look at how function approximation is applied when the state and/or action spaces become too large to sample effectively using exact techniques, such as value iteration or policy iteration.

The Q-function $Q(s, a)$—that is, the state-action value function—is a mapping of states and actions to a value in the set of all real numbers:

\begin{align}
Q: S \times A \rightarrow \mathcal{R}
\label{q_mapping}
\end{align}

Recall that value iteration evaluates the utility of a state as the value of the maximum Q-value at that state, over all the actions the agent can take in that state, as we see in Equation \ref{value_iter}.

\begin{align}
V(s) = \max_a Q(s, a)
\label{value_iter}
\end{align}

When we look at methods that use the Q-function, such as value iteration, we form a table for $V$ (see left of Fig. 1) and for the Q-values (see right of Fig. 1) and iteratively update those tables as our estimates of the utility of each state and the expected value of each state-action pair improve during training.

Left side: A table showing the utility of specific states in a simple discrete (or discretized) state space. Right side: A table showing the Q-value of specific state-action pairs in a scenario with simple discrete (or discretized) state and action spaces. (Note that these values are for illustrative purposes only.)
The utility of each state and the Q-value of each state-action pair can be represented in tabular forms. Note that the V(S) values in the left table correspond to the Q-values of the maximizing actions, according to Equation \ref{value_iter}.

Under the tabular (i.e. exact) setting, we can, given a sufficient number of iterations, learn the optimal Q-function, value function, or policy.

Usually, however, our state spaces are far too large for tabular Q-learning to be remotely practical. It simply takes far too long to cover the entire table thoroughly enough to reach convergence to the optimal policy, or anything even close. Even high-performance computers will lack the RAM required to store a large Q-table in memory. Rather than wait years for our learner to visit every state-action pair in the Q-table enough to allow the values to converge, we instead try to form an estimate of what this table would tell us using function approximation.

This function approximation was originally performed by effectively placing multi-dimensional Gaussians in our $Q(s, a)$ table to represent the action choices in the state space with probability distributions. These Gaussians interpolate a desired output value (e.g. a q-value) between inputs (e.g. state-action pairs). The number of Gaussian kernels $\mathcal{k}$ that we need to adequately cover all of the states and actions could be far smaller than the size of the $Q$-table, $|S| \times |A|$.

Modern approaches to function approximation employ neural networks. The neural network learns a surface, such as that shown in Figure 2, that covers what the Q-table did and estimates the utility of taking a specific action in a specific state, just as the Q-table does. Fitting a surface over the state $\times$ action space instead of a table has an additional benefit: when the agent finds itself in a state it has not encountered before, it can still interpolate an estimate of the utility of its current state using the function approximator (i.e. the neural network). An important point to remember, though, is that function approximation, despite this benefit, is almost never going to perfectly learn the Q-function. We can apply gradient descent to get closer to the true value function, but our function approximator will
almost never be perfectly accurate.

In this image, we illustrate that the π value of a specific state-action pair is output by the actor network used in a policy-gradient setup.
An example of a surface learned by a neural network trained with DQN.

Thus, using a function approximation is not true interpolation between the known-value points, but it is something close to it.

Function approximation is also used in policy gradient methods; neural networks can be used to represent the policy, as illustrated in Figure 2, and the transition function, as illustrated in Figure 3.

In this image, we illustrate that the π value of a specific state-action pair is output by the actor network used in a policy-gradient setup.
Representation of the policy network in policy gradient setups.
In this image, we illustrate that the transition probability of starting in state s, taking action a, and arriving in state s' is defined by a neural network.
Function approximation can represent the state transition function in policy gradient learning scenarios.

In a previous entry, we introduced the Bellman Residual (Equation \ref{bellman-res}). If we are parameterizing $Q$ by some $\theta$ such that $Q_{\theta}: S \times A \longrightarrow \mathcal{R}$, then we can use the Bellman residual as the loss function we use to train $Q_{\theta}$.

\begin{align}
L(\theta) = \mathbb{E} \left[ \left((r_t + \gamma \max_a Q_{\theta}(s_{t+1}, a)) – Q_{\theta}(s_t, a_t)\right)^2 \right] \label{bellman-res}
\end{align}

Equation \ref{bellman-res} aims to minimize the error between the points that the agent has visited and estimated Q for and our predictions of what those points should be.

What is the problem with this? When we apply gradient descent, we assume that $(r_t + \gamma \max_a Q_{\theta}(s_{t+1}, a))$ from Equation \ref{bellman-res} is a constant with respect to $\theta$. Thus, when we take the gradient of $L(\theta)$, the only thing we apply the derivative to is $Q_{\theta}(s_{t+1}, a)$. We will see why this creates challenges for us in the next section.

2. Bootstrapping

With respect to the Deadly Triad, we define bootstrapping as taking information we already have and trying to do more with it than we should be able to do. Let’s look at the instability this can cause and the ways practitioners have come up with to mitigate these instabilities.

When we looked at policy gradients and REINFORCE, we learned that the update for the policy parameters $\phi$ is defined by Equation \ref{policy-grad-eqn}, where $i$ is the designator of the trajectory from which the states and actions are drawn (for the case of doing multiple policy rollouts), and $t$ is the timestep.

\begin{align}
\Delta \phi = A_{it} \nabla_{\phi} \log \pi_{\phi}(s_{it}, a_{it})
\label{policy-grad-eqn}
\end{align}

We compute $A_{it}$, the cumulative reward term, as shown in Equation \ref{a-i-t}.

\begin{align}
A_{it} = \sum_{t’=t}^{T_i} \left(\gamma^{t’-t} r_{t’} \right)
\label{a-i-t}
\end{align}

How is this applied? When computing $\Delta \phi$ and $A_{it}$, we look at each timestep in the trajectory from timestep $t$ until the end, identify what the reward is, and back up the (discounted) reward across prior timesteps. This means that a reward applied many timesteps into the future still affects this cumulative reward term, $A_{it}$. In a scenario that gives sparse rewards, backing up rewards to influence the
evaluations of earlier timesteps in this manner is particularly helpful, as it helps the agent learn the steps it took along the way to getting a reward at some specific timestep. This gives the learner information from later in the trajectory instead of it needing to rely on a Q-function to tell it whether it is going in a (positively) rewarding direction.

Notably, an alternative way to compute the gradient of $\phi$ is Equation \ref{alt-phi-grad}.

\begin{align}
\Delta \phi = Q_{\phi}(s, a) \nabla_{\phi} \log \pi_{\phi}(s_{it}, a_{it})
\label{alt-phi-grad}
\end{align}

When we are training $Q_{\phi}(s, a)$ with the Bellman residual and holding $(r_t + \gamma \max_a Q_{\phi}(s_{t+1}, a))$ constant, we encounter what we call the “chasing one’s tail” or “runaway” effect, which we discuss in the blog post about DQN. Specifically, in Equation \ref{alt-phi-grad}, we are using the Q-function as its own target by regressing $Q(s, a)$ towards the target, $R(s, a) + \gamma \max_{a’} Q(s’, a’)$, which itself contains $Q$!.
That is what bootstrapping means in the context of the Deadly Triad— using the thing that you are trying to achieve as the very thing you are using to achieve it.

What does not bootstrapping look like in Q-learning? Instead of computing $L(\theta)$ as we did in Equation \ref{bellman-res}, we compute the loss by looking $n$ steps into the future, as we demonstrate in Equation \ref{n-step-lookahead}.

\begin{align}
\hspace{-3.75em}L(\theta) = \mathbb{E} \left[ \left(\sum_{t’ = t}^{t + n – 1} \left(r_t + \gamma^{t’ – t} r_{t’}\right) +
\gamma^{t + n} \max_a Q(s_{t+n}, a) – Q_{\theta}(s_t, a_t) \right)^2 \right] \label{n-step-lookahead}
\end{align}

We select $n$ such that the $n$th step will be beyond the final timestep of the episode, which simplifies Equation \ref{n-step-lookahead} to Equation \ref{n-step-lookahead-simp}.

\begin{align}
L(\theta) = \mathbb{E} \left[ \left(\sum_{t’ = t}^{t + n – 1} \left(r_t + \gamma^{t’ – t} r_{t’}\right) – Q_{\theta}(s_t, a_t) \right)^2 \right] \label{n-step-lookahead-simp}
\end{align}

Q-learning is known for having an overestimation problem; the $\max$ operation that determines which action to select in a given state biases the estimation of Q-values of states to the high side, sometimes to the point of drowning out the reward signal. Sometimes this problem can be fixed by hyperparameter tuning, but not always. So how do we keep the bias caused by bootstrapping at bay?

How to fix bootstrapping?

A common tactic to alleviate the bias caused by bootstrapping is by applying Target Q-learning; that is, we have two Q-networks, and calculate the Bellman residual as shown in Equation \ref{loss-fcn-targ}.

\begin{align}
L(\theta) = \mathbb{E} \left[ \left((r_t + \gamma \max_{a’} Q_{\theta’}(s_{t+1}, a’)) – Q_{\theta}(s_t, a_t) \right)^2 \right] \label{loss-fcn-targ}
\end{align}

The difference between Equations \ref{bellman-res} and \ref{loss-fcn-targ} is that Equation \ref{loss-fcn-targ} uses the target Q-network to compute the maximizing action for time $t+1$ rather than the same Q-network as in the latter part of the loss function equation. We update the target Q-network as the main Q-network trains, as in Equation
\ref{target-update}, but it lags behind; $\alpha$ is a small scalar. This lagging target network update can help to stabilize the runaway chasing-one’s-tail effect.

\begin{align}
\phi \leftarrow \phi \left(1 – \alpha\right) + \alpha \theta
\label{target-update}
\end{align}

Another approach to ameliorating the problems of bootstrapping is Inverse Double Q-learning. The $\max_{a’}$ term in the Bellman residual equation (Equation \ref{bellman-res}) is replaced with $Q_{\theta}(s_{t+1}, \underset{a’}{\operatorname{argmax}}~ Q_{\phi}(s_{t+1}, a’))$, where $Q_{\theta}$ is for the “real” network and $Q_{\phi}$ is the target network. Equation \ref{inv-double-q} is the resulting loss function for Inverse Double Q-learning.

\begin{align}
\hspace{-3.25em}L(\theta) = \mathbb{E} \left[ \left((r_t + \gamma Q_{\theta}(s_{t+1}, \underset{a’}{\operatorname{argmax}}~ Q_{\phi}(s_{t+1}, a’))) – Q_{\theta}(s_t, a_t)\right)^2 \right] \label{inv-double-q}
\end{align}

In this case, we still use the learner network to estimate the Q-value, but rather than picking the action that is maximizing in state $s_{t+1}$ according to the learner network, we use the target network to evaluate the maximizing action.

Yet another approach to ameliorate bootstrapping, Double Deep Q-Networks (DDQN), replaces the $\max$ term in the Bellman residual with $Q_{\phi}(s_{t+1}, \underset{a’}{\operatorname{argmax}}~ Q_{\theta}(s_{t+1}, a’))$, a term that uses the target network (subscript $\phi$) to evaluate the value of a state-action pair and uses the learner network (subscript $\theta$) to evaluate which action is the maximizing action in that state. The resulting loss function for DDQN is given in Equation \ref{ddqn-targ}.

\begin{align}
\hspace{-3.25em}L(\theta) = \mathbb{E} \left[ \left((r_t + \gamma Q_{\phi}(s_{t+1}, \underset{a’}{\operatorname{argmax}}~ Q_{\theta}(s_{t+1}, a’))) – Q_{\theta}(s_t, a_t)\right)^2 \right] \label{ddqn-targ}
\end{align}

The advantage of DDQN’s approach is that we use a target network, but the action selection is decoupled from between the target and learner networks. In general, DDQN is generally accepted as the current “best” method for ameliorating the effects of bootstrapping. An additional variant on this approach is to use two target networks, evaluate which action to select using both target networks, and choosing the action that gives the lower Q-value.

In general, all of these methods of mitigating the chasing-one’s-tail problem of bootstrapping involve using a target network and decoupling the action selection from the evaluation of the state-action pair’s value in calculating the Bellman residual.

3. Off-policy Learning

In the context of the Deadly Triad, we primarily use the term “Off-policy learning” to refer to RL setups that utilize experience replay during training. Specifically, we utilize data generated using an older version of our Q-function under an $\epsilon$-greedy sampling strategy to improve our current estimate of the Q-function when we operate under it with $\epsilon = 0$. For more detail on on- and off-policy learning, please see the blog post on the subject. Looking at that blog post and what it says about Deep Deterministic Policy Gradients (DDPG), we know that off-policy learning says that $Q_{\theta}(s, a)$ is not a critic of $\pi’
(\cdot | s)$, an arbitrary policy network. Instead, it is a critic of $\pi^*(s) = \underset{a}{\operatorname{argmax}}~Q(s,a)$. The process by which we generate the state-action pairs to train our Q-function can be different from what the critic is evaluating; that is, we decouple sampling/exploration from what the critic is judging. This decoupling is generally considered preferable for sample-efficiency reasons, among others. Off-policy learning, however, is difficult to do well in the context of experience replay.

What is Experience Replay?

In experience replay, we have a database, $D = \{ \langle s_0, a_0, r_0, s_0′ \rangle, \langle s_1, a_1, r_1, s_1′ \rangle, \dots \}$, in which we collect sample experiences from which we train our policy network. These databases, often called replay buffers, are key components of many RL frameworks.

Even though we train from a replay buffer of past experiences, we can modulate the degree to which we train our network off-policy. Q-learning is not technically off-policy, as, when taking the expectation of the Bellman residual, the states from which we compute the expectation must be drawn from somewhere. The distribution from which the states are drawn in computing that expectation should be a uniform distribution over all states, as in Value Iteration, but instead, we draw the state-action pairs from the replay buffer, whih is biased towards some state-action pairs over others based upon our sampling history. The replay buffer is skewed towards the policy used to generate the experiences
within it.

In the paper that inspired this blog post, van Hasselt, et al. explore prioritizing the experiences in their agent’s replay buffer by setting the probability of sampling a specific point $p_i$ from the replay buffer to be proportional to the error—the Bellman residual. That is, they rank the samples in the replay buffer according to how wrong the function approximator is when evaluating them and then train the Q-function to prioritize selecting points that are ranked as being more incorrect so that the learner can fix its evaluation of the Q-value estimate of those experiences. While this approach seems reasonable—a human who is very unskilled at a specific task who practices that task is sure to learn to improve their performance of that task, even if only a little—it actually increases the likelihood of the Q-network diverging when compared to training the Q-network on uniformly-sampled experiences. The takeaway here is that, while prioritizing specific examples in a replay buffer is an interesting area of research, if we just want our learner to learn, sampling non-uniformly from our replay buffer could make things worse. An additional divergence risk is the size of the neural network itself. In general, for DQN, the risk of the function approximator diverging increases as the network size increases. Conversely, for DDQN, the risk of the network diverging is agnostic of network size.

Conclusions

DQN and its variants are frequently used because, not only are they widely applicable, but they work in a lot of scenarios. The Deadly Triad—function approximation, bootstrapping, and off-policy learning— can all cause complications in the application of DQN, however; in this blog post, we covered these challenges and common approaches to handling them. The main takeaways from this blog post are:

  1. Function approximation does not have the same nice guarantees that tabular learning has, but it is far more flexible when the state and/or action spaces are large.
  2. To ameliorate the instability bootstrapping can cause, consider either using a multi-step update in the Bellman residual or using DDQN.
  3. When using a replay buffer, sample from the replay buffer uniformly—don’t try to prioritize any experiences over others

Many thanks to Ruisen “Eric” Liu for editing this blog post and providing suggestions for how to improve it.

References

  1. Mnih, Volodymyr, et al.“Playing Atari With Deep Reinforcement Learning.” arXiv preprint arXiv:1312.5602 (2013).
  2. van Hasselt, Hado, et al. Deep Reinforcement Learning and the Deadly Triad. arXiv preprint arXiv:1812.02648 (2018).
  3. R. S. Sutton and A. G. Barto. Reinforcement Learning: An Introduction, 2nd ed. The MIT Press, Cambridge MA, 2018.
  4. Lillicrap, Timothy P., et al.Continuous Control With Deep Reinforcement LearningInternational Conference on Learning Representations, Sep. 2016.

Filed Under: Bootcamp

Bootcamp Summer 2020 Week 4 – On-Policy vs Off-Policy Reinforcement Learning

February 28, 2022 by mschrum3 and Matthew Gombolay

 

Reinforcement learning algorithms can be divided into two main types: on-policy and off-policy. Understanding the differences between on-policy and off-policy learning is essential to understanding and implementing various reinforcement learning algorithms. Below we will discuss the key differences in on-policy and off-policy learning, the algorithms that employ these two different approaches, and the implications of these learning approaches.

  1. What is on-policy versus off-policy?
  2.  Examples of on-policy and off-policy algorithms
  3. Implications
  4. Advantages and Disadvantages
  5. Takeaways
Break-down of types of reinforcement learning algorithms.

 

1. What is on-policy versus off-policy?

On-policy and off-policy learning fall under the category of model-free reinforcement learning algorithms, meaning that we do not have access to the transition probability distribution. This is opposed to model-based methods such as Monte-Carlo Tree Search (MCTS).

To begin our discussion of on-policy and off-policy reinforcement learning, we will first introduce some terminology. In a reinforcement learning setting, an agent interacts with an environment, moving from state, $s$, to $s’$ via action, $a$, and transition probability, $T(s,a,s’)$. The agent’s policy, $\pi(s)$ dictates the action that an agent takes in state, $s$. The objective of reinforcement learning is to find the optimal policy, $\pi^*$, which maximizes future discounted expected reward. The Q-function, $Q(s,a)$, describes the value of a state action pair. If we know the optimal Q-function, $Q^*$, which describes the maximum expected reward of any given state-action pair, then we can extract the optimal policy according to Eq. 2.

\begin{equation}
Q^*=R(s,a)+\gamma \mathop{\mathbb{E}}_{s’}[V^*(s’)] \end{equation}

\begin{equation}
\pi^*(s)=argmax_{a}Q^{*}(s,a) \quad \forall s
\end{equation}

 

To understand the difference between on-policy learning and off-policy learning one must first understand the difference between the behavior policy (i.e., sampling policy) and the update policy.

Behavior Policy: The behavior policy is the policy an agent follows when choosing which action to take in the environment at each time step. In Q-learning, our behavior policy is typically an $\epsilon$-greedy strategy. In an $\epsilon$-greedy behavior policy, an agent chooses the action dictated by the current policy, $\pi(s)$, with probability 1-$\epsilon$ and a random action with probability $\epsilon$. The behavior policy generates actions and determines how the agent interacts with the environment.

Update Policy: The update policy is central to understanding the difference between on-policy and off-policy learning. The update policy is how the agent updates the Q-function. In other words, it dictates how the agent derives the state-action pairs which are used to calculate the difference between the actual Q-value and current predicted Q-value, also known as the TD-error. The TD-error is then used to update the Q-function.

This brings us to the key difference between on-policy and off-policy learning: On-policy algorithms attempt to improve upon the current behavior policy that is used to make decisions and therefore these algorithms learn the value of the policy carried out by the agent, $Q^{\pi}$. Off-policy algorithms learn the value of the optimal policy, $Q^*$, and can improve upon a policy that is different from the behavior policy.  Determining if the update and behavior policy are the same or different can give us insight into whether or not the algorithm is on-policy or off-policy. If the update policy and the behavior policy are the same, then this suggest but does not guarantee that the learning method is on-policy. If they are different, this suggests that the learning method is off-policy.

Let’s take a look at some examples.

2. Examples

Here we discuss several examples of on-policy versus off-policy algorithms and highlight the key differences between them. The important lines in each algorithm are shown in red.

Below is the pseudo-code for SARSA (State, Action, Reward, State, Action), an on-policy algorithm.

SARSA often utilizes an $\epsilon$-greedy behavior policy and follows the same policy when updating the Q-function, meaning that $a’$ is also selected based upon an $\epsilon$-greedy strategy as shown in Line 7. Thus, because SARSA learns the quality of the behavior policy rather than the quality of the optimal policy, SARSA is an on-policy algorithm.

Below is the pseudo-code for Q-learning.

The key difference between Q-learning and SARSA is the
$max_{a’} Q(s’,a’)$ term found in the Q-learning algorithm. This $\max$ term means that Q-learning is learning the value of the optimal policy. SARSA’s Q-function models the q-values of the behavior policy whereas Q-learning learns the optimal q-values. This distinction is what makes SARSA on-policy and Q-learning off-policy.

Let’s take a look at another algorithm, Deep Deterministic Policy Gradient (DDPG). DDPG is an actor-critic method. This means that it concurrently learns both a policy as represented by the actor and the Q-function as represented by the critic. DDPG also utilizes experience replay in which it stores state-action-state-reward tuples in a buffer and samples these examples during training. We focus on the part of DDPG that is relevant to our on-policy versus off-policy discussion. Below is a subsection of the pseudocode. $\pi_{\theta}$ describes the policy the agent is learning parameterized by $\theta$ and the Q-function is parameterized by $\phi$.

We see in Line 6 that DDPG utilizes a stochastic behavior policy defined by the actor $\pi$ plus exploration noise. What about the update policy? The update policy is shown in Line 11. Unlike in Q-learning, there is no max when selecting the next action in Line 11. Instead we see that we are using the actor $\pi_{\theta}$ to selection $a’$ which sounds like DDPG is optimizing its current policy and therefore is an on-policy algorithm. So does this mean that, because DDPG is updating the Q-function via $Q_{\phi}(s’,\pi_{\theta}(s’))$ instead of $\max_{a’}$ $Q_{\phi}(s’,a’)$, that DDPG is on-policy? No! Based on our definition, an on-policy algorithm improves upon the current policy, whereas an off-policy algorithm learns the value of the optimal policy. With DDPG, as time goes to infinity, the actor selects the action which maximizes the Q-function as shown in Line 12. Because at convergence, $\pi_{\theta}$ is optimal and therefore $Q_{\phi}(s’,\pi_{\theta}(s’))=\max_{a’}$ $Q_{\phi}(s’,a’)$, DDPG is evaluating the optimal policy, making it an off-policy algorithm.

To the right is a list of other common on-policy and off-policy algorithms. The same analysis of the behavior policy and the update policy can be used to verify the type of learning that each of these algorithms employs.

3. Implications

So what are the implications of these differences? In the end does the type of learning matter? When should we use which algorithm?

This figure depicts the cliff walking domain. The SARSA agent learns the safer path shown in blue and the Q-learning agent learns the optimal path shown in red [1].

Different Policy Outcomes: First of all, these different training methods can result in different behaviors due to the fact that on-policy methods learn $Q^\pi$ and off-policy methods learn $Q^*$. For example, in Q-learning because the update policy follows a greedy policy, it assumes that the agent will act optimally in the future and therefore, Q-learning will learn the optimal policy as long as all states and actions are sufficiently experienced. As shown in the above figure, this means that the cliff walking agent will learn to take the optimal path from the start to goal even though this path may be dangerous if the agent’s actions are stochastic.

SARSA, on the other hand, assumes that the agent will follow the current policy in the future. This means that when updating the Q-function, we assume that the cliff walking agent will at times jump over the cliff if it travels too close to the edge. Therefore, SARSA learns a policy which is safer and ensures that the agent will avoid the cliff. SARSA only learns a near-optimal policy that depends on the behavior strategy. For example, if the behavior strategy is $\epsilon$-greedy then SARSA will learn at optimal $\epsilon$-greedy strategy. To more closely approximate the optimal policy, one can decay $\epsilon$ to decrease exploration over time. SARSA is guaranteed to converge to the optimal policy if the behavior policy converges to a greedy policy and all states and actions are visited an infinite number of times.

 

This figure shows the rewards received by an agent trained via SARSA and an agent trained via Q-learning in the cliff walking domain. In this domain, SARSA achieves higher rewards than Q-learning despite learning a sub-optimal policy [1].

Although Q-learning learns the optimal strategy, SARSA may have better online performance. As shown in the plot, SARSA obtains higher episodic rewards on average than Q-learning in the cliff domain due to the fact that the agent takes a safer route. However, this may not be the case for all domains.

When to employ which learning strategy: If, for example, one is utilizing reinforcement learning to train a physical robot then one may opt for a more conservative and safer learning scheme such as the on-policy SARSA. SARSA also tends to be more stable while training. However, if training in simulation, Q-learning may be more effective since it is more likely to learn the optimal policy and less likely to become trapped in a local minimum. Additionally, off-policy algorithms can take advantage of experience replay when the behavior policy is changing or even if the behavior policy is generated from a source other than the learning agent (e.g., a human demonstrator). On-policy algorithms can only utilize experience replay when the behavior policy is static. Experience replay allows us to utilize historical data when learning which can help de-correlate training examples and is useful when gathering experience is costly. Additionally, off-policy learning is agnostic to how the data is generated and can therefore utilize data generated from other sources. Consequently, off-policy learning may be preferable in situations in which the data must be generated in a way that does not follow the current policy, for example, via human demonstrations.

4. Advantages and Disadvantages

5. Takeaways

  1. To identify if a reinforcement learning algorithm is off-policy or on-policy, one must first identify the behavior policy and the update policy. If the algorithm attempts to improve the behavior policy, it is on-policy. If however, the algorithm learns the optimal policy, then it is off-policy.
  2. SARSA is an example of on-policy learning and Q-learning is an example of off-policy learning. Algorithms such as DDPG, which learn the quality of the optimal policy, are off-policy strategies.
  3. The strategies that results from on-policy learning versus off-policy learning can differ in a number of ways including how safe the policy is, its convergence guarantees, and the online performance.

References
[1] Richard S. Sutton and Andrew G. Barto. Reinforcement Learning: An Introduction. The MIT Press, second edition, 2018.

 

Filed Under: Bootcamp

AAMAS 2022 Tutorial – Graph Neural Networks: Learning Representations of Robot Team Coordination Problems

January 18, 2022 by Matthew Gombolay and Zheyuan Wang

Tutorial at the International Conference on Autonomous Agents and Multi-Agent Systems 2022

Overview

Robot teams are increasingly being deployed in environments, such as manufacturing facilities and warehouses, to save cost and improve productivity. To efficiently coordinate multi-robot teams, fast, high-quality scheduling algorithms are essential to satisfy the temporal and spatial constraints imposed by dynamic task specification and part and robot availability. Traditional solutions include exact methods, which are intractable for large-scale problems, or application-specific heuristics, which require expert domain knowledge. What is critically needed is a new, automated approach that automatically learn lightweight, application-specific coordination policies without the need for hand-engineered features.

This tutorial is an introduction to graph neural networks and a showcase of the power of graph neural networks solving multirobot coordination problems. We survey various frameworks of graph neural networks in recent literature, with a focus on their application in modeling multi-agent systems. We will introduce the multi-robot coordination (MRC) problems and review the most relevant methods available to solving MRC problems. We will discuss several successful applications of graph neural networks in MRC problems, with hands-on tutorials in the form of Python example code. With this tutorial we aim at providing an experience of employing graph neural networks in modeling multi-robot systems, from algorithm development to code implementation, thus opening up future opportunities in designing graph-based learning algorithms in broader multi-agent research.

Content

In this tutorial, we will discover the power of graph neural networks for learning effective representations of multi-robot team coordination problems. The tutorial will feature two 90-minute online sessions.

Date and Time: May 8, 11:00 – 14:00 (EDT)

  • 3:00 – 6:00, May 9, Auckland Time
  • 11:00 – 14:00, May 8, New York Time
  • 17:00 – 20:00, May 8, Paris Time
  • 20:30 – 23:30, May 8, Kolkata Time
  • 23:00 – 2:00 (+1 day) May 8, Beijing Time

Zoom link: https://gatech.zoom.us/j/93966169541 [Video recordings are posted below]

Session #1 by Matthew Gombolay (11:00 – 12:30)

The first session will address the following: (a) How graph neural networks work – we will present a comprehensive overview of various graph neural networks proposed in prior literature, including both homogeneous and heterogeneous graphs and attention mechanisms; (b) How to model team coordination problems with graph neural networks – we will discuss what are the suitable applications that can be modeled in graph neural networks, with a focus on MRC problems; (c) How to optimize the parameters of graph neural networks for team coordination problems – we will discuss what learning methods can be used for training a graph neural network-based solver. We conclude this session with the most recurrent challenges and open questions.

Tutorial notes: [Slides]

Video:

Session #2 by Zheyuan Wang (12:30 – 14:00)

The second session will provide a hands-on tutorial for how to get up and running with graph neural networks for coordination problems, with coding examples in Python Jupyter notebook. In particular, we will look into the ScheduleNet architecture [6], a heterogenous graph neural network-based solver for MRC problems under temporal and spatial constraints. The Jupyter notebook will work through the model implementation, training and evaluation of ScheduleNet models on synthetic dataset.

Python Jupyter notebook: [Open in Google Colab] [Github]

Video:

Target Audience

The target audience is intended to be students and researchers who have a strong grasp of machine learning but who may as of yet be unfamiliar with graph neural networks. Prior knowledge of Multi-agent Reinforcement Learning (MARL) or Planning & Scheduling would be helpful but is not necessary. While the tutorial showcases the application of graph neural networks in solving multi-robot coordination problems, the methodology introduced can be utilized by the audience to broader research problems in learning representations of multi-agent systems.

Presenters

Dr. Matthew Gombolay is an Assistant Professor of Interactive Computing at the Georgia Institute of Technology. He received a B.S. in Mechanical Engineering from the Johns Hopkins University in 2011, an S.M. in Aeronautics and Astronautics from MIT in 2013, and a Ph.D. in Autonomous Systems from MIT in 2017. Gombolay’s research interests span robotics, AI/ML, human–robot interaction, and operations research. Between defending his dissertation and joining the faculty at Georgia Tech, Dr. Gombolay served as a technical staff member at MIT Lincoln Laboratory, transitioning his research to the U.S. Navy, earning him an R&D 100 Award. His publication record includes a best paper award from American Institute for Aeronautics and Astronautics, a finalist for best student paper at the American Controls Conference, and a finalist for best paper at the Conference on Robot Learning. Dr Gombolay was selected as a DARPA Riser in 2018, received 1st place for the Early Career Award from the National Fire Control Symposium, and was awarded a NASA Early Career Fellowship for increasing science autonomy in space.

Zheyuan Wang is a Ph.D. candidate in the School of Electrical and Computer Engineering (ECE) at the Georgia Institute of Technology. He received the B.S. degree and the M. E. degree, both in Electrical Engineering, from Shanghai Jiao Tong University. He also received the M.S. degree from ECE, Georgia Tech. He is currently a Graduate Research Assistant in the Cognitive Optimization and Relational (CORE) Robotics lab, led by Prof. Matthew Gombolay. His current research interests focus on graph-based policy learning that utilizes graph neural networks for representation learning and reinforcement learning for decision-making, with applications to human-robot team collaboration, multi-agent reinforcement learning and stochastic resource optimization.

Reading Materials

  1. Ernesto Nunes, Marie Manner, Hakim Mitiche, and Maria Gini. 2017. A taxonomy for task allocation problems with temporal and ordering constraints. Robotics and Autonomous Systems 90 (2017), 55–70.
  2. Petar Veličković, Guillem Cucurull, Arantxa Casanova, Adriana Romero, Pietro Liò, and Yoshua Bengio. 2018. Graph attention networks. International Conference on Learning Representations (2018).
  3. Xiao Wang, Houye Ji, Chuan Shi, Bai Wang, Yanfang Ye, Peng Cui, and Philip S Yu. 2019. Heterogeneous graph attention network. The World Wide Web Conference (2019), 2022–2032.
  4. Jie Zhou, Ganqu Cui, Shengding Hu, Zhengyan Zhang, Cheng Yang, Zhiyuan Liu, Lifeng Wang, Changcheng Li, and Maosong Sun. 2020. Graph neural networks: A review of methods and applications. AI Open 1 (2020), 57-81.
  5. Zheyuan Wang and Matthew Gombolay. 2020. Learning scheduling policies for multi-robot coordination with graph attention networks. IEEE Robotics and Automation Letters 5, 3 (2020), 4509–4516.
  6. Zheyuan Wang, Chen Liu, and Matthew Gombolay. 2021. Heterogeneous graph attention networks for scalable multi-robot scheduling with temporospatial constraints. Autonomous Robots (2021), 1–20.

 

Filed Under: Bootcamp

Bootcamp Summer 2020 Week 7 – Variational Inference in Heterogeneous Apprenticeship Learning

November 6, 2021 by Rohan Paleja and rliu302

Imagine we have an autonomous car that would like to perform speedy highway driving. There are several approaches to program such an autonomous vehicle. The first would be to hand-craft a rule-based system consisting of what to do in different scenarios. An example rule could be “if the autonomous car would like to move forward and there is a car in front and to the left, switch to the right lane”. A programmer could write down hundreds of rules consisting of every possible scenario. However, this process is expensive and leaves much to be desired. The second approach would be to utilize a driving simulator and design a reward function that defines the proper autonomous driving behavior. Then, one could use reinforcement learning to learn an optimal driving policy. However, defining such a reward function is difficult. Furthermore, if the driving simulator has differences with the real world, the behavior of the autonomous car may degrade when deployed. Lastly, one can utilize learning from demonstration (LfD), where user demonstrations are used directly to learn autonomous driving behavior. LfD has many benefits, including that it can learn human-like behavior and does not require a simulator. It is also much easier for a human to give an example of driving on the highway than defining all the rules used during driving. However, what happens when two drivers present demonstrations with conflicting behaviors (e.g., one prefers to pass on the left versus the other prefers to pass on the right)?

In this post, we are going to discuss how robots can learn from heterogeneous human demonstrators, demonstrators that will perform the same task in different ways.

This blog post will address the following questions in 3 parts:

  1. What challenges arise from heterogeneity in demonstrations?
  2. How can we automatically infer preferences within demonstrations?
  3. How can we utilize person-specific embeddings in policy learning?

1. Challenges from heterogeneity of demonstrations

We can imagine multiple real-life scenarios, ranging from driving to the setting of a dinner table, in which there are heterogeneous approaches that can be utilized to reach the same goal. Continuing the autonomous driving example from the introduction, Figure 1 displays the heterogeneity that can occur within driving demonstrations. Within the example displayed, some subset of the demonstrators preferred to pass on the left while others preferred to pass on the right. LfD approaches that assume homogeneous demonstration will either fit to the mean (drive straight into the car in front) or fit a single-mode (only pass on one side), both producing suboptimal/limited driving behavior. It is important to model the heterogeneity within the Learning from Demonstration framework. We will accomplish this via variational inference for person-specific embeddings.

Heterogeneity in Highway Driving Demonstrations

The decision-making we employ to select our approach can be explicitly defined as a policy; a function that maps states $S$ to actions $A$ (Equation 1).

\begin{align}
\hat{\pi}: S \rightarrow A
\end{align}

Depending on an implicit demonstrator preference, the demonstrator may find it advantageous to choose one policy over the other. For example, a demonstrator setting the table may be left-handed and prefer to set a drinking glass on the left-hand side. Without explicitly labeling demonstrations by preference, the context within this selection process is lost to a robot attempting to learn the task, making it very difficult to learn a policy representation across all demonstrators. Without access to latent information about demonstrators, current LfD approaches sacrifice optimal behavior for consistency, weaken the robustness of the robot to different scenarios, and reduce the number of examples the robot can use to learn from.

As we consider scenarios where the number of demonstrations is numerous, it becomes advantageous to be able to autonomously infer distinct modalities without requiring someone at hand to label each demonstration. In addition, we would also like to learn a policy representing user demonstrations within our dataset. How can one accomplish both objectives, inferring latent information about demonstrated trajectories and learning a policy conditioned upon contextual and latent information, simultaneously?

2. Inferring trajectory modalities

We seek to autonomously infer distinct strategies or modes $\omega$ from the real-world demonstrations we are able to acquire. Following our autonomous driving example, Figure 2 displays the end result we seek to achieve: mode 1 demonstrations are assigned a specific latent code and mode 2 demonstrations are assigned a different latent code. Visually, in Figure 2 it is clear that there are only two types of demonstrations. We would like our approach to discover first, that there are different modes within the demonstration data-set and second, how many modes are present.

Inferred Demonstration Modalities in Highway Driving Demonstrations

To do so, we would like to maximize the mutual information $I$ between our embedding mode $\omega$ and the actions taken by our policy $\pi_\theta(a|\omega,s)$, where our policy is conditioned on the state and latent code. We display a representation of the mutual information in Equation 2, where H is the entropy function.


\begin{gather}
I(\omega;\pi(a|\omega,s)) = H(\omega) – H(\omega|\pi(a|\omega,s))
\end{gather}

Maximizing mutual information encourages $\omega$ to correlate with semantic features within the data distribution (i.e., mode discovery), which is exactly what we want! However, maximizing the mutual information between the trajectories and latent code is intractable as it requires access to the true posterior, $P(\omega|s,a)$, representing the distribution of modalities given trajectories. Instead, we can utilize a lower bound of the mutual information and maximize the lower-bound via gradient descent, thereby maximizing the mutual information and incentivizing the policy to utilize $\omega$ as much as possible.


Recall that the KL divergence is a measure of similarity between two probability distributions. We can express the KL divergence between $p$ and $q$ with Equation 3.

\begin{align}
D_{KL} \left( p (\omega | s, a) \ || \ q(\omega | s, a) \right) = \mathop{\mathbb{E}}_{\omega \sim p(\omega | s, a)} log \frac{p(\omega | s, a)}{ q(\omega|s,a)}
\end{align}

We display the derivation of the mutual information lower bound in Equation 4. Below, we provide a step-by-step walkthrough of the derivation.
\begin{align}
\begin{split}
I(\omega;\pi(a|\omega,s)) &= H(\omega) – H(\omega|\pi(a|\omega,s))) \\
&= H(\omega)-\mathop{\mathbb{E}}_{a \sim \pi(\cdot| \omega, s)}\big[\mathop{\mathbb{E}}_{\omega \sim p(\omega| s,a)} [- \log p(\omega|s,a)\big]] \\
&= H(\omega)+\mathop{\mathbb{E}}_{a \sim \pi(\cdot| \omega, s)}\big[\mathop{\mathbb{E}}_{\omega \sim p(\omega| s,a)} [\log p(\omega|s,a)\big]] \\
&= H(\omega)+D_{KL}(p(\omega|s,a)||q(\omega|s,a)) + \mathop{\mathbb{E}}_{a \sim \pi(\cdot| \omega, s)}\big[\mathop{\mathbb{E}}_{\omega \sim p(\omega| s,a)} [\log q(\omega|s,a)\big]] \\
&\geq H(\omega) + \mathop{\mathbb{E}}_{a \sim \pi(\cdot| \omega, s)}\big[\mathop{\mathbb{E}}_{\omega \sim p(\omega| s,a)} [\log q(\omega|s,a)\big]] \\
\end{split}
\end{align}

In line 1 of the derivation, we have Equation 2. To transition into line 2 of the derivation in Equation 4, we simplify the second entropy term. Line 3 of the derivation brings the negative sign outside of the expectation. Line 4 utilizes the identity $\mathop{\mathbb{E}}_{x \sim p_{X}(\cdot)}(p(x)) = D_{KL}(P||Q) + \mathop{\mathbb{E}}_{x \sim p_{X}(\cdot)}(q(x))$ in transitioning from line 3. Here, q is an approximate posterior for the true distribution of embeddings given trajectories. In transitioning to line 5, since the KL divergence between two distributions is always positive, we replace the equal sign with a greater than or equal to and remove the divergence term. We note that during optimization, we assume that the distribution of embeddings is a static, uniform distribution, and thus, the entropy across latent codes, $H(\omega)$ is constant and can be removed. Line 5 displays the final result, the lower bound of the mutual information.

Notice that we have now described a substitute objective function that is expressly in terms the latent mode variable $\omega$ and our approximate posterior $q$. Maximizing this objective should make different latent embeddings correspond with different policy behaviors. However, one problem remains: our computation of $q$ is based on the sampling over the expectation of the true posterior p($\omega$|s,a), which is still unknown. Therefore, we seek to gradually tease out the true distribution while training through sampling. We denote $\omega$, the discovered latent codes, as person-specific embeddings as each embedding displays latent features for a specific demonstrator.

3. Simultaneously learning a policy while inferring person-specific embeddings

Our goal is to learn demonstrator policies that are conditioned on person-specific embeddings, $\omega$. Thus, even if heterogeneity is present with the demonstrations, the latent encodings will be able to represent the unique behaviors among demonstrators. The resultant policy is able to adapt to a person’s unique characteristics while simultaneously leveraging any homogeneity that exists within the data (e.g., uniform adherence to hard constraints).

Accordingly, we will want to maximize our ability to imitate demonstrators via a behavior cloning loss and the mutual information between person-specific embeddings and trajectories via a variational information maximization (VIM) loss. We display a general overview of the architecture used during learning from heterogeneous demonstration in Figure 3.

Architecture to Learn from Heterogeneous Demonstrations.

In this figure, we have a policy represented by $\pi$ and parameterized by $\theta$, and an approximate posterior represented by $q$ and parameterized by $\phi$. We start with a set of person-specific embeddings initialized uniformly. As training proceeds, we infer the person-specific embeddings ($\omega$ is a learned latent encoding via gradient descent) and learn a policy representing all demonstrators. We display a sample algorithm in Figure 4 for learning $\theta$, $\phi$, and $\omega$.

Algorithm for Learning from Heterogeneous Demonstrations .

In step 1 of the algorithm, we sample from the set of demonstrators and obtain the trajectories associated with the sampled demonstrator. We initialize a new embedding, $\omega^{(i)}$, for the demonstrator. In step 2 and 3, we conduct a forward pass through the architecture displayed in Figure 3 and update parameters $\theta$ and $\omega$ via both the behavior cloning and VIM loss. In step 4, we utilize the VIM cost function to update $\phi$. We repeat this process until convergence. Once the algorithm has converged, every demonstrator’s person-specific embedding will accurately represent their modality. Utilizing this latent embedding within the learned policy will produce a representation of the demonstrator’s behavior.

In conclusion, the main takeaways from our blog post are as follows:

  1. It is important to model the heterogeneity when Learning from Heterogeneous Demonstration (LfHD). We present a framework that can adapt to a person’s unique characteristics while simultaneously leveraging any homogeneity within the data.
  2. We infer person-specific latent embeddings with semantic meaning by maximizing the lower bound of the mutual information between an inferred latent variable and a policy that is conditioned upon the latent variable.

Further Details

If you want to explore more on this, feel free to check our recent NeurIPS paper.

Filed Under: Bootcamp

Bootcamp Summer 2020 Week 3 – Value Iteration and Q-learning

January 19, 2021 by Matthew Gombolay and Pradyumna Tambwekar

Value iteration and Q-learning make up two fundamental algorithms of Reinforcement Learning (RL). Many of the amazing feats in RL over the past decade, such as Deep Q-Learning for Atari, or AlphaGo, were rooted in these foundations. In this blog, we will cover the underlying model RL uses to describe the world, i.e. a Markov decision process (MDP), as well as two algorithms for performing RL: value iteration and Q-learning. By the end of this blog post, you should be able to understand the connection between value iteration and Q-learning and how to employ either of these algorithms to learn the optimal policy.

This post is divided into three parts:

  1. Markov decision processes
  2. Value functions and Value iteration
  3. Q-learning and Deep Q-learning

Prerequisites:

The only prerequisites you will need to understand this blog are gradient descent and backpropagation. Both of these topics were covered in our first two blogposts (week 1 and week 2).

Markov Decision Processes

An MDP is a 5-tuple describing an environment consisting of the following five components:

  1. States: A state, $s$ is any predefined momentary instance in the world that an agent can exist in. For the rest of this post, we will use the variable $S$ to represent the set of all possible states in the world with $s \in S$ referring to an individual state.
  2. Actions: An action, $a$, is an event facilitated by the agent that can transition you from one state to another provided that such a transition is possible MDP. We will use $A$ to represent the set of all possible actions in the world, with $a \in A$ referring to an individual action. We note that actions may not have deterministic consequences. For example, flipping a coin may not give you the same result each time! The degree to which actions have deterministic effects is described by the transition function.
  3. Transition function: The transition function is a function that defines the probability of moving to a specific next state, given your current state and a valid action. The transition function, $T$, is mathematically defined as follows, $ T: S \times A \times S’ \rightarrow [0,1]$.
  4. Reward: The reward function specifies a real number value that defines the efficacy or a measure of “goodness” for being in a state, taking an action and landing in the next state. Similarly to the transition function, the reward is defined as follows, $R: S \times A \times S’ \rightarrow \rm I\!R $.  Note that the state you end up in may be uncontrollable since state transitions can be dynamic.
  5. Discount Factor: The discount factor can be specified using $\gamma$, where $\gamma \in [0,1)$. Note the non-inclusive upper bound for the discount factor (i.e., $\gamma \neq 1$). Disallowing $\gamma = 1$ allows for an MDP to be more mathematically robust. Specifically, the goal for RL algorithms is often to maximize discounted reward across time. Consider the case of an infinite horizon MDP (i.e., the MDP never ends) in which the rewards are always positive. If the discount factor, $\gamma$, is equal to 1, then the sum of future discounted rewards will be infinite, making it difficult RL algorithms to converge (i.e., know when they can stop determine which actions should be taken in each state). Thus, the choice of gamma is a critical for the success of RL algorithms, e.g. Q-learning and value iteration.

You may have read about a concept known as Markov chains. While Markov Chains will not be covered in this post, it is important to understand the distinction between markov chains and markov decision processes, as both concepts share the markov property. Fundamentally, a markov chain consists of the components of a markov decision process except actions, rewards, and a discount factor. An agent in a markov chain is not in control of their movements, the world is in control of the agent’s movements. In a markov decision process the agent has an influence on the outcomes. Thus the transition function of a markov chain is simply defined as $T: S \times S’ \rightarrow [0,1]$.

Markov Property: The markov property holds when the next state depends only on the current state and current action and is independent of the history of states and actions prior to that.

Nota bene: In some MDPs, you may see the initial state distribution included as a part of the MDP. While that is still a valid representation, in this blog, we are going to remain agnostic of the initial state distribution.

Value functions and Value Iteration

The value function, $V(s)$, is a measure of the expected reward you can receive from any given state $s$ given an MDP and a policy describing which actions an agent takes in each state. Formally, a policy, $\pi: S \rightarrow [0,1]^{|A|}$, is a probability distribution over actions, $a \in A$, conditioned on a state, $s$. For the sake of this blog, we will consider a policy to be deterministic (i.e., $\pi(s,a) = 1$ for a single action, $a$, and $\pi(s,a’)=0$ for all other $a’ \neq a$). Having defined a policy, we can now mathematically define the value function for a policy: $$V^{\pi}(s) = E[\sum_{t=0}^{\infty} \gamma^t r_t | \pi, s_0 = s]$$ In a deterministic world, this expectation could be ignored; however, in general, we include the expectation as both the policy (e.g., an agent that doesn’t always like the same thing for breakfast) and the transition function (e.g., how bad traffic is) could describe nondeterministic attributes of the problem we want to model.

The goal of Value Iteration is to find a policy that maximizes the value function: $$\pi^*(s) = argmax_{\pi \in \Pi} V^{\pi}(s), \forall s \in S $$ In this equation $\Pi$ represents the set of all possible policies within the MDP, and $\pi^*(s)$ is the optimal policy. What the value function equation (i.e., $V^{\pi}(s) = E[\sum_{t=0}^{\infty} \gamma^t r_t | \pi, s_0 = s]$) states is that you are going to follow a policy, $\pi$, starting at state, $s$, for up to an infinite number of time steps, $t$, proceeding forward from $t=0$ to $t = \infty$ along a trajectory (i.e., a sequence of state-action pairs, $(\langle s_1,a_1 \rangle , \langle s_2,a_2 \rangle,\ldots)$), where the accumulated reward, $r_t$, is discounted via multiplication with $\gamma^t$. This process of discounting rewards means that, the further you go into the future, rewards have a diminishing impact on value or “goodness” of being in state, $s$. Adjusting $\gamma$ allows you to define your preferences towards optimizing for short-term vs long-term gain. The expectation matters because, as stated in the previous paragraph, transitions can be non-deterministic, thus the expectation provides a normalized estimate of the discounted reward.

Value iteration is a computational algorithm that provides a means of finding the optimal policy, $\pi^*$. The algorithm works by iteratively determining the value of being in each state, $s,$ assuming that the agent takes the best possible action in that state under the current estimate of the value function. The value iteration algorithm is shown in the figure below.

Value_Iteration
Value Iteration

This algorithm iteratively updates the value, $V(s)$, for each state, $s$ until it reaches a point where the change in value is negligible, as defined by the threshold $\Delta_\circ$. Take note of Line 7 in this algorithm in which we consider a state, $s$, an action, $a$, and the resulting state, $s’$, in a sequence, $\langle s, a, s’ \rangle$. Line 7 works by updating the value for the given state, and is called the value update and is reminiscent of Bellman’s optimality condition. This condition says $$ V(s) = \max_{a \in A} E [R(s,a,s’) +  \gamma \max_{a’}V(s’)] = \max_{a \in A} \sum_{s’\in S} T(s,a,s’)[R(s,a,s’) +  \gamma V(s’)]$$

When this condition holds for a given state, $s$, we would find in Line 8 that $temp – V(s) = 0$. If the condition holds for all states, $s$, then the value function has converged to the optimal value function, $V^{*} = V^{\pi^*}$, and the optimal policy, $\pi^*$, can simply be exctracted by Line 11. In a sense, Value iteration works by iteratively pretending or enforcing (Line 7) that Bellman’s optimality condition holds, measuring if the condition does not hold (Line 8), and, terminating once the condition holds within an acceptable margin of error (Line 3).

Upon convergence (i.e., reaching an acceptable level of error, $\Delta < \Delta_0$), this Value Iteration returns a value function, $V^{*}$, that provides an estimate of the future expected reward while following the optimal policy, $\pi^*$, as well as the optimal policy itself. To be perfectly clear, we should note that the actual value function and policy returned are only approximately optimal, and it would be more technically accurate to refer to them as $V \approx V^*$ and $\pi \approx \pi^*$.

Nota Bene: In some textbooks the bellman’s update equation may be alternatively defined, considering the reward function to be independent of the next state $s’$, i.e. $$V(s) \leftarrow  \max_a R(s,a) + \sum_{s’} T(s,a,s’)\gamma V(s’)$$ However the update rule shown in the value iteration algorithm above is a more complete representation of the value estimate because you cannot always assume that, for any given MDP, your reward will not depend on the next state.

Let us now define surrogate terms, the Q-value and the Q-function. Similar to the value function, the Q-function is a quality measure. However the unlike the value function, the Q-function measures the expected discounted reward for taking a particular action at a given state, i.e. $R: S \times A \rightarrow \rm I\!R$. We refer to the Q-value as what the Q-function returns.

Although Q-learning is the go-to algorithm for estimating Q-values and consequently optimal policies from these Q-values, you may also modify the value iteration algorithm to solve for the Q-values.

Explain Value iteration using a Q function instead of a value function
Value Iteration (To compute Q-values)

There are two primary differences between both algorithms resulting from an introduction of a Q-function. First, since the Q-function is dependent on both states and actions, we need to loop through all possible actions (Line 6) as well as the set of all states. Second, you’ll notice that the update rule (Line 8) has changed upon incorporating the Q-function. Since the estimate for the optimal Q-value, $Q(s’,a’)$, for the next state, $s’$, depends on the next action, $a’$, we need to replace “$V(s’)$” from Line 7 of Algorithm 1 with “$\max_{a’} Q(s’,a’)$” in Line 8 of Algorithm 2.

We note that Algorithms 1 and 2 are often referred to as flavors of “exact RL” because, unlike in function approximation algorithms commonly used in RL (i.e., deep learning-based RL), here, we are guaranteed to able to solve for the true, optimal Q-values within a margin of error, $\Delta$, and given sufficient computational resources. “Approximate” RL techniques (e.g., deep RL) do not typically have such guarantees.

An important question to answer is, then, “Why do we not do this in practice instead of Q-learning?” Unfortunately, most real-world applications of RL contain too many states to iterate over even once, let alone the multitude of times required for the algorithm to converge to an acceptable error threshold. Hence, function approximation algorithms, such as Deep Q-learning are preferred.

Q-Learning and Deep Q-learning

Prior to discussing Q-learning itself, let us define a term called the Bellman Residual.

$ \delta(s,a,s’) = [R(s,a,s’) +  \gamma \max_{a’}Q(s’,a’)] – [Q(s,a)] $$

The bellman residual, denoted by $\delta$, computes a measurement error for Q-learning to describe how wrong the current estimate of the Q-function is given a transition sequence, $\langle s,a,s’ \rangle$. When the Q-function is optimal, we should find that the modified form of the Bellman optimal condition holds: $$Q(s,a) = E[R(s,a,s’) + \gamma \max_{a’\in A} Q(s’,a’)] = \sum_{s’} T(s,a,s’)[R(s,a,s’) + \gamma \max_{a’\in A} Q(s’,a’)]$$

When this condition holds, we should find that $ \delta(s,a,s’) = 0 $ for all possible transition sequences. However, when we first initialize a Q-function, as we do in Line 2 of Algorithm 3, our Q-function will almost certainly be wrong, and $ \delta(s,a,s’)$ will not equal zero. So, how do we improve our Q-function to be correct?

The way Q-learning works is by “blending” our current estimate of the Q-function, $Q(s,a)$, with a “point-estimate” of what the Q-function should be, i.e. $R(s,a,s’) +  \gamma \max_{a’}Q(s’,a’)$. The degree of blending is controlled by a hyperparemeter, $\alpha$ in the following equation: $$ Q(s,a) = (1 – \alpha) Q(s,a) + \alpha (R(s,a,s’) +  \gamma \max_{a’}Q(s’,a’)) $$

If $\alpha = 1$, then we completely ignore our current estimate of the Q-function. If $\alpha=0$, then we completely ignore new information. A decent starting place for setting $\alpha$ in practice is to choose $\alpha = 0.99$, which puts most of the emphasis on the current estimate of the Q-function. However, we encouage you to experiment with this parameter!

The full algorithm for Q-learning is shown in the algorithm pictured below.

Q-Learning Algorithm
Q-Learning Algorithm

It is important to note that, unlike in the version of the bellman equation described earlier, Q-learning does not include a transition function in its Bellman update equation! In Q-learning, we instead simulate actions within the environment and utilize the states visited during the simulation (i.e., the trajectory) to apply the bellman equation. The simulate function in the algorithm shows how trajectories are sampled in Q-learning. In each timestep of the trajectory, we use an epsilon-greedy sampling strategy to select an action. If a randomly sampled probability is greater than a predefined $\epsilon$ value, then we greedily sample the action, otherwise we randomly sample an action from the space of all actions. Upon selection, we simulate that action and observe the next state and reward received for that action. Finally we store all this observed information into the memory buffer, update the current state and repeat this process until we reach a terminal state. The trajectory computed from each simulation is then used to update the Q-values via the Bellman update equation (line 6 in Q-learning).

The absence of a transition function makes Q-learning a model-free RL algorithm, as it does not need any prior knowledge of “the world”  to learn the optimal policy. This model-free characteristic is significant because, in practice, you rarely have access to the transition probabilities for all state action pairs in real-world application.

With enough time and sampled trajectories, Q-learning will be able to estimate the optimal Q values for each state action pair. However, as currently constructed, the Q-learning algorithm requires a lookup table for each possible state action pair, and filling in those values. The size of the state-action spaces in real-world tasks makes maintaining a lookup table of Q-values infeasible. One possible solution to this lack of computational power is the use of function approximation techniques, e.g. Deep Q-learning. The term “deep” comes from parameterizing Q-values using neural networks. Therefore, instead of learning a literal table of Q-values, our goal shifts towards learning the weights of a neural network that can output the Q-value for any given state action pair. Employing the square of the Bellman residual as our loss function, we can apply the backpropagation algorithm (Covered in Week 2) to learn the weights of the Q-network.

Deep Q-learning
Deep Q-learning

When compared to Q-learning, there are three primary reasons why Deep Q-learning is better suited for real world applications,

  1. Not enough memory: Q-learning requires storing a lookup table of all states and actions, leading to an infeasible RAM requirement for most modern applications.
  2. Not enough time: Deep Q-learning is often significantly faster to train because of the ability to incorporate techniques like batched gradient descent and adaptive optimization. In addition to this, Deep Q-learning does not require convergence of the Q-value for each state action pair.
  3. Interpolation: Deep Q-Learning is a function approximation algorithm. Because the Q-network approximates Q-values for states and actions, we can assume that the network can interpolate q-values for similar state and action pairs i.e, $ Q(s,a) \approx Q(s’,a’)$ if $(s,a) \approx (s’,a’)$. In practice, Deep Q-learning is more efficient than seeking convergence for each individual state-action pair. However, there are pathological cases where this may not be true, for example, when the slightest change in the state or action may result in drastically different outcomes.

While we have not yet covered enough yet to fully implement Deep Q-learning and understand the tips-and-trick in practically implementing a Deep Q-learning model, please stay tuned for a future blog on that topic!

Key Takeaways from this blog:

  1. A Markov Decision Process(MDP) is a tuple used to describe an environment and is made up of components used in reinforcement learning algorithms.
  2. The goal of most reinforcement learning algorithms is to maximize the cumulative expected reward, which is also known as the value function.
  3. Value iteration is an iterative algorithm that uses the bellman equation to compute the optimal MDP policy and its value.
  4. Q-learning, and its deep-learning substitute, is a model-free RL algorithm that learns the optimal MDP policy using Q-values which estimate the “value” of taking an action at a given state.

Filed Under: Bootcamp Tagged With: bellman equation, Markov Decision Process, MDP, Q-Learning, Reinforcement learning, Value Iteration

  • Page 1
  • Page 2
  • Next Page »

Primary Sidebar

Secondary Sidebar

People

  • Matthew Gombolay
  • Nakul Gopalan
  • Laura Strickland
  • Zheyuan Wang
  • Rohan Paleja
  • Mariah Schrum
  • Esi Seraj
  • Andrew Silva
  • Erin Hedlund
  • Pradyumna Tambwekar
  • Zac Chen
  • Manisha Natarajan
  • Nina Moorman

Copyright © 2023 · eleven40 Pro on Genesis Framework · WordPress · Log in