Bootcamp

Bootcamp Summer 2020 Week 8 – Karush-Kuhn-Tucker (KKT) conditions

August 30, 2023 by Nakul Gopalan, Zac Chen and Matthew Gombolay

Introduction

To solve any machine learning problem we need three ingredients: some data observed in the world, a model to explain our data, and an optimizer that can fit the model to the data. In this blog we re-visit the optimization problem, which we first explained in the Gradient Descent post. We present the problem of constrained optimization and explain the conditions required to achieve an optimal solution when solving constrained optimization problems.

Humans approximately solve constrained optimization problems frequently. For example, if someone wants to find a way to get to the city as quickly as possible without paying highway tolls, that person would be solving a constrained optimization problem. Specifically, the objective is to get to the city as quickly as possible, and the constraint is to not use any roads or highways that might require the person to pay tolls. Such problems are commonplace and require a search over all solutions to verify that the solution is optimal and does not violate any of the constraints. From a computational perspective these constrained optimization problems are NP-hard.

When a robot is attempting to learn to play table tennis, there might be other constraints in the optimization problem, such as not hitting the table. If the learning is done without putting such constraints in, the robot can hit the table during learning, as shown above, which would be disastrous to the robot.

These problems are also present in robot learning. Consider a robot learning to hit a table-tennis ball from demonstrations (Figure 1). We can collect demonstration data of a human moving the robot’s arm through the motion of a table tennis stroke. We might then have a supervised learning model to learn the task by minimizing the distance between the demonstrated and learned robot trajectories. However, there are also several possible constraints to this learning problem. First is a constraint of physics, such as the robot’s forward dynamics model. The forward dynamics model for a robot specifies how much its joints can move under a given force. These can be thought of as equality constraints, $\tau = H(q)\ddot{q}$, where a given torque, $\tau$, specifies the amount a robot joint can move via a joint-space inertia matrix, $H$, whose variables are the robot’s joint angles $q$. Other specific constraints might be that the robot needs to learn these trajectories safely, for example, the robot should never hit the table while playing (Figure 1). This could be done by creating a box around the table and telling the robot to never enter the box. Such a constraint might be an inequality constraint, i.e., $x < 10$, where $x$ is the robot's end-effector position along the $x$-coordinate and at $10m$ might be the edge of the table w.r.t. to the origin of the robot's coordinate frame. Hence the learning problem can have both equality and inequality constraints. Such an optimization problem can be specified as: \begin{alignat*}{3} & \text{minimize}& f(\mathbf{x})\\ & \text{subject to} \quad& h_j(\mathbf{x}) = 0, \forall j\\ & \quad& l_i(\mathbf{x}) \leq 0, \forall i, \label{eq:constrained_optimization} \end{alignat*} where the constraints for the robot's dynamics ($h_j$) and physical space ($l_i$) are specified and the goal would be minimize a loss function for the supervised scenario where the robot generates a table-tennis trajectory similar to the human's demonstration.

Equality Constraints in a learning problem

Contour lines for the two variable optimization goal function $f(\mathbf{x})$ and the equality constraint $h(\mathbf{x})$. The optimal values for $x_1, x_2$ are locations where the constraint curve is tangent to the contour, i.e., the gradient, w.r.t. vector $\mathbf{x}$, of the constraint is parallel to the gradient, w.r.t. $\mathbf{x}$, of the function to be optimized. This is a necessary condition, which helps formulate the Lagrangian multiplier formulation of the optimization problem. Image inspired from Nexcis, Public domain, via Wikimedia Commons.

We first examine the solution to the constrained optimization problem considering only equality constraints. Consider the problem of form:
\begin{alignat}{3}
& \text{minimize} \quad f(x_1,x_2) \nonumber \\
& \text{subject to} \quad h_0(x_1,x_2) = c \quad .
\end{alignat}

In this example vector $\mathbf{x} = [x_1,x_2]$, and there is only one constraint $h_0$. Figure 2 provides an example function $f(\mathbf{x})$, and its equality constraint $h(\mathbf{x})$. Notice that the optimal point for the function under the constraint is a point where the constraint function is tangent to the isoline with value $d_1$. This is because moving along the equality constraint does not make the values of the function $f(\mathbf{x})$ increase in either direction, meaning that this is a stationary point. An optimal value of the function under constraints can only exist on such stationary points. This stationarity condition is a necessary condition for optimality. Further, the gradient of the function is parallel to the gradient of the equality constraint, as the two curves, i.e, the optimization function and constraint, are tangent at these stationary points. If the two curves are tangent, their gradients vectors differ only by a scalar, i.e., $\nabla f = \lambda \nabla h$. This property is shown in Figure 2 more clearly.

The Lagrangian function can then be defined as:
\begin{equation}
\mathcal{L}(\mathbf{x},\lambda) = f(\mathbf{x}) + \lambda h(\mathbf{x}),
\end{equation}
where the necessary, but not sufficient, condition for optimality is $\nabla_{\mathbf{x}, \lambda} \mathcal{L}(\mathbf{x}, \lambda) = 0$. We can now use gradient descent to find a locally optimal solution for the Lagrangian function, which will provide us with solution for both, the parameter $\lambda$, and values for the vector $\mathbf{x}$ for the locally optimal solution. If the two functions, $f$ and $h$ are convex, then the solution will be a globally optimal solution. This technique of converting a constrained optimization problem into the new function where the constraints are summed is called the Lagrange multipliers technique. The Lagrange multipliers technique hence allows us to use gradient descent to solve a constrained optimization problem, which allows for a computationally feasible way to solve these problems, without the use of matrix inversions.

Inequality constraints in the optimization problem

To solve constrained optimization problems with inequality constraints, we again use Lagrangian multipliers. Consider the following constrained optimization problem:
\begin{alignat}{3}
& \text{minimize}& f(\mathbf{x}) \nonumber \\
& \text{subject to} \quad& h_j(\mathbf{x}) = 0, \forall j \nonumber \\
& \quad& l_i(\mathbf{x}) \leq 0, \forall i
\end{alignat}
This is the primal problem, where we are minimizing the function w.r.t. to the input variable $\mathbf{x}$. However, a dual problem can be formulated, which is maximized w.r.t. to the constraint variables. Sometimes the dual problem might make the primal problem computationally tractable, or vice versa, because the evaluation in one of the forms for the function being computed might be easier.

A dual problem changes the variables of the optimization problem and the direction of the optimization from a minimization problem to a maximization problem. The value of the primal and the dual problem at the optimal solution are equal, where the minimization problem provides the upper bound, and the maximization problem, i.e., the dual, presents the lower bound for the values of the primal problem.
We formulate the dual with the Lagrangian:
\begin{equation}
\mathcal{L}(\mathbf{x},\mathbf{\lambda},\mathbf{\alpha}) = f(\mathbf{x}) + \sum_j \lambda_j h_j(\mathbf{x}), + \sum_i \alpha_j l_i(\mathbf{x}).
\end{equation}
The dual function using this Lagrangian would be:
\begin{equation}
g(\mathbf{\lambda},\mathbf{\alpha}) = \textrm{min}_x \mathcal{L}(\mathbf{x},\mathbf{\lambda},\mathbf{\alpha}),
\end{equation}
and the dual optimization problem would be:
\begin{alignat}{3}
\lambda^*\!, \alpha^* =& \text{argmax}_{\mathbf{\lambda}, \mathbf{\alpha}} g(\mathbf{\lambda}, \mathbf{\alpha}) \nonumber \\
& \alpha_i \geq 0, \forall i .
\label{eq:dual_prob}
\end{alignat}

To find the optimal solution for the constrained optimization problem we need to find values of $\mathbf{x},\mathbf{\lambda},\mathbf{\alpha}$ at which the Lagrangian is stationary, and both the primal and dual problems are satisfied without any constraint violations. We use gradient descent iteratively to solve this problem:

[Step 1] Find the optimal value of $\mathbf{x}$ under which the Lagrangian is minimized, i.e, $\mathbf{x}* = \textrm{argmin}_{\mathbf{x}} \mathcal{L}(\mathbf{x},\mathbf{\lambda},\mathbf{\alpha})$. We pick the values for $\mathbf{\lambda}$ and $\mathbf{\alpha}$ at random for the first iteration.

[Step 2] Holding the optimal value of $\mathbf{x}*$ found previously, we now find the gradient of $\mathcal{L}$ w.r.t. $\mathbf{\lambda}$ and $\mathbf{\alpha}$, and update their values with a learning rate to perform gradient ascent. This is akin to moving slowly towards the optimal solution of the dual problem defined in Eq. \ref{eq:dual_prob}.

We iterate over both these steps until we reach the stopping conditions, which are called the Karush–Kuhn–Tucker (KKT) conditions for optimality. These conditions are:

Stationarity: The gradient of the Lagrangian is zero as explained in the previous section, i.e., $\nabla \mathcal{L} = 0$.
Primal feasiblity: The constraints of the primal problem are satisfied, i.e., $h_j(\mathbf{x}) = 0, \forall j,$ and $l_i(\mathbf{x}) \leq 0, \forall i$.
Dual feasiblity: The constraints of the dual problem are satisfied, i.e., $\alpha_i \geq 0, \forall i$.
Complimentary Slackness: $\sum_i \alpha_j l_i(\mathbf{x}) = 0$. This ensures that either the inequality constraints in the primal are satisfied by equality, i.e., are tight, and $\alpha_i$ is unbounded, or $\alpha_i = 0$, and the constraint has a lot more slack in satisfiability.

The KKT conditions provide the necessary conditions for local optimality for a constrained optimization problem with inequality constraints. These conditions were developed as a generalization to the Lagrange Multipliers technique that account for both equality and inequality constraints. The Lagrange multipliers technique developed in the previous section is a technique only applicable to equality constraints.

We can compute the solution of the primal and the dual at each iteration of the optimization problem, and also compute the optimality gap: $\frac{p^k-d^k}{d^k}$, where $p^k$ and $d^k$ are the values of the primal and dual problems in iteration $k$ of the optimization. Here the solution of the minimization primal problem provides the upper bound of the function value and the solution of the dual provides the lower bound of the function value, which converge towards each other asymptotically over optimization iterations. The optimality gap provides a range within which the optimal constrained solution lies and gives us an upper and lower bound over our solution quality.

Key Takeaways

Constrained optimization is an important and computationally challenging problem needed for robot and machine learning.
Given a constrained optimization problem, a Lagrangian can be formulated where the constraints are used as additive factors to the optimization function. The stationary points of the Lagrangian are candidates for the optimal solution of the optimization problem.
The Lagrange multipliers technique allows us to use gradient descent to solve a constrained optimization problem, which allows for a computationally feasible way to solve these problems by avoiding matrix inversions. These solution methods take a different form to solve approximately for non-convex problems which are generally faster.
Karush–Kuhn–Tucker (KKT) conditions provide the necessary conditions for local optimality for a constrained optimization problem with inequality constraints. These conditions were developed as a generalization to the Lagrange Multipliers technique.

Bootcamp Summer 2023 Week 1 – Projected & Mirror Gradient Descent

July 30, 2023 by Byeolyi Han, Arthur Scaquetti do Nascimento, Rohan Paleja and Matthew Gombolay

Gradient descent is an ubiquitous optimization algorithm used to train machine learning (ML) algorithms from simple linear regression models to sophisticated transformer architectures. ML researchers are typically introduced to unconstrained optimization problems (e.g., least-squares, LASSO, or Ridge regression). However, many really-world ML problems are constrained optimization problems. For example, in robotics, we may want to train a robot for maneuvering around a crowd by imitating human motion (e.g., Behavior Cloning where the robot minimizes the divergence between the distribution of actions the robot takes and that the human would take in the same state of the world, $x$) subject to the constraint that the robot will never crash into a person.

In this blog post, we will look at some powerful approaches for performing constrained optimization. First, we quickly recap the concept of gradient descent. Next, we’ll delve into Projected Gradient Descent, a technique that introduces constraints to the optimization process. Finally, we’ll uncover the magical world of Mirror Gradient Descent, a method that combines the power of duality and optimization. For an example of how Mirror Gradient Descent has been used, check out these papers:

Montgomery, W.H. and Levine, S., 2016. Guided policy search via approximate mirror descent. Advances in Neural Information Processing Systems, 29.
Zhang, M., Geng, X., Bruce, J., Caluwaerts, K., Vespignani, M., SunSpiral, V., Abbeel, P. and Levine, S., 2017, May. Deep reinforcement learning for tensegrity robot locomotion. In 2017 IEEE international conference on robotics and automation (ICRA) (pp. 634-641). IEEE.
Cheng, C.A., Yan, X., Wagener, N. and Boots, B., 2018. Fast policy learning through imitation and reinforcement. arXiv preprint arXiv:1805.10413.

Recap: Gradient descent
Imagine we are trying to train a neural network with a parameter $\theta$. Given $\theta$, the neural network produces the prediction $f(\theta)$. We have a cost function $\mathcal{L}(\theta)$. Our goal is to find the optimal parameter $\theta^*$ that minimizes $\mathcal{L}(\theta)$. Let’s consider the standard $L^2$-norm loss function with respect to our dataset $D$, defined as follows:
\[\mathcal{L}(\theta) = \Sigma_{(x,y) \in D} \frac{1}{2} ||y-f_\theta (x)||^2 .\]

Starting from an initial point $\theta_0$, Gradient Descent performs the following update at every timestep:
\[\theta_{t+1} \leftarrow \theta_t -\alpha \nabla_\theta \mathcal{L}(\theta_t).\]

Note that we are going opposite to the gradient direction (minus sign in the gradient) since we are “minimizing” the loss function. When the dataset is too large to compute the gradient at every timestep, we use Stochastic Gradient Descent which computes the gradient with respect to a single example, or Mini-batch Gradient Descent which computes the gradient with respect to a small batch sampled from the dataset at every timestep.

For more information such as time complexity and convergence rate, please refer to our previous blog post: Bootcamp Summer 2020 Week 1 – Gradient Descent.

Constrained Optimization Problem
In many real-world scenarios, optimization problems are often attached to certain constraints or boundaries that resulting parameters must satisfy. These constraints can represent physical limitations, practical considerations, or desired properties of the solutions. For example, consider a manufacturing company that produces various products. The company aims to maximize the overall utility derived from these products, taking into account factors such as customer demand, quality, and profitability. However, the company faces a budget constraint, which limits the total amount of money available for production.

To formulate this as a constrained optimization problem, let’s first define the following variables:
$x_1, \cdots, x_n$: The quantities of each product to be manufactured.
$U(x_1, \cdots, x_n)$: The utility function that quantifies the overall satisfaction derived from the products.
$c_1, \cdots, c_n$: The respective costs of producing each product.

The objective is to maximize $U(x_1, x_2, \cdots, x_n)$ while satisfying the constraint of the total budget, $B$. Mathematically, the constrained optimization problem can be formulated as:

\begin{align*}
Maximize_{x_1, \cdots, x_n} \text{ } & U(x_1, \cdots, x_n)\notag\\
\text{subject to } & \Sigma_{i=1}^n x_i*c_i \leq B.\notag
\end{align*}

This example highlights the need for algorithms to solve such constrained optimization problems that arise in various fields, such as resource allocation, portfolio optimization, and image reconstruction.

Projected Gradient Descent

Then, how can we incorporate the constraints into the optimization process so that we can balance between optimizing the objective function and constraining parameters? Projected Gradient Descent (PGD) addresses this challenge by ensuring that the iteratively updated parameters belong to the feasible region defined by the constraints.

The key idea behind PGD is to project the current solution onto the feasible region at each iteration. This projection step involves mapping the updated parameter onto the closest point within the feasible region in case if lands outside of it, effectively “projecting” it onto the constraint boundaries. By doing so, PGD guarantees that the resulting parameters remain feasible throughout the optimization process.

To illustrate, let’s consider a constraint $A\theta = b$ for our optimization problem. Assume Q is the feasible set for the constraint (the red region from Fig. 1) and P is the projection function onto $Q$:

\[Q = \{\theta|A\theta = b\},\]
\[P: \theta’ \mapsto arg\,min_{\theta \in Q} \frac{1}{2} ||\theta – \theta’||^2 .\]

Projected gradient descent is defined as follows:

\[\theta_{t+1} \leftarrow P(\theta_t – \alpha_t \nabla_\theta \mathcal{L}(\theta_t)).\]

Projected gradient descent. The blue arrow denotes the usual gradient descent step on the loss function. The red dotted line denotes the projection onto the feasible set.

For the Lipshitz constant $L$ and the step-size $\alpha_t = \frac{R/G}{\sqrt{t}}$, it is known that if $R \geq ||\theta_0 – \theta||_2$ and the gradient of $\mathcal{L}$ is bounded by a constraint $G$, i.e., $G \geq ||g||_2$, $\forall g \in \frac{\partial \mathcal{L}}{\partial \theta}$, the inequality $\mathcal{L}(\theta_t) – \mathcal{L}(\theta^*) \leq C \frac{L}{\sqrt{t}}$ holds.

Mirror Gradient Descent

Mirror Gradient Descent (MGD) is a fascinating optimization technique that combines the power of optimization with the concept of duality. It draws inspiration from convex duality, a fundamental concept in convex optimization that establishes a relationship between primal and dual problems. MGD exploits this duality to transform the original optimization problem into a dual problem and solves it using a gradient descent-like approach.

In MGD, the optimization process involves iteratively updating the primal parameter in the primal space, while simultaneously updating the corresponding dual parameter in the dual space. The updates are performed by taking steps in the direction opposite to the gradients of the primal and dual objectives. This dual update is analogous to a mirror reflection of the primal update, which is the reason why this algorithm is called by the name “Mirror Gradient Descent.”

The motivation to work in the dual space is that it can often be simpler or have desirable geometric or algebraic properties, leading to more efficient optimization. MGD has been applied to various domains, including convex optimization, machine learning, and signal processing. It offers a powerful framework to tackle optimization problems with complex constraints or objectives, leveraging the interplay between primal and dual updates to achieve efficient convergence and desirable parameters.

Before we dig into MGD more, let’s first define Bregman Divergence of a function $h$ as $D_h (x’||x)$.

\[D_h (x’||x):= h(x’)-h(x)-\langle \nabla h(x), x’-x \rangle \]
($\langle \cdot,\cdot \rangle$ is dot product.)

If we do the first order Taylor expansion of a function $h$, the tangential line at $h(x)$ crosses a point $(x’, h(x) + \langle \nabla h(x), x’-x) \rangle) $. Bregman Divergence $D_h (x’||x)$ is the gap between $h(x’)$ and $h(x) + \langle \nabla h(x), x’-x) \rangle $. It becomes large when the function $h$ is steeper.

Mirror gradient descent is defined as follows:
\[\theta_{k+1} \leftarrow arg\,min_\theta {\langle \alpha_k \mathcal{L}(\theta_k), (\theta – \theta_k) \rangle + D_h (\theta||\theta_k)}. \]

The concept of mirror gradient descent is that we first transform (with $\nabla h(\theta_k)$) the primal space of $\theta$ into the dual space of $\phi$ which possibly have an easier geometry, update the parameter, and do inverse-transform to come back to the primal space.

Mirror gradient descent. The blue dotted line denotes the projection onto the feasible set of the primal space.

Now, let’s first get a mirror image in the dual space and then project it back (with $(\nabla h)^{-1} (\phi_{k+1}))$ to the primal space.

\begin{align*}
0 & = \frac{\partial}{\partial \theta} \{\alpha \langle \nabla \mathcal{L}(\theta_k), \theta_{k+1}-\theta_k \rangle + h(\theta_{k+1}) – h(\theta_k) – \langle \nabla h(\theta_k), \theta_{k+1} – \theta_k \rangle \} \\
& = \alpha \nabla \mathcal{L}(\theta_k) + \nabla h(\theta_{k+1}) – \nabla h(\theta_k)
\end{align*}

Finally, we get the update equation for mirror gradient descent:
\[\nabla h(\theta_{k+1}) = \nabla h(\theta_k) – \alpha \nabla \mathcal{L}(\theta_k) \]
\[ \theta_{k+1} \leftarrow (\nabla h)^{-1} [\nabla h(\theta_k) – \alpha \nabla \mathcal{L}(\theta_k)]. \]

Note that for constrained optimization program, the projection onto the feasible set can be applied afterward.

Example 1. For $h(\theta)=\frac{1}{2}||\theta||^2$, Bregman Divergence becomes $D_h(\theta||\theta_k) = \frac{1}{2} ||\theta-\theta_k||^2 $. Substituting Bregman Divergence and rewriting the update equation, we get

\[\theta_{k+1} \leftarrow arg\,min_\theta \{ \langle \alpha_k \nabla_\theta \mathcal{L}(\theta_k), (\theta – \theta_k)\rangle + \frac{1}{2} ||\theta – \theta_k||^2 \}.\]
Then,

\begin{align*}
0 & = \nabla_\theta f = \frac{\partial f}{ \partial \theta} \\
& = \frac{\partial}{\partial \theta} \{\langle \alpha_k \nabla_\theta \mathcal{L}(\theta_k), (\theta – \theta_k) \rangle + \frac{1}{2} ||\theta – \theta_k||^2\} \\
& = \frac{\partial}{\partial \theta} \langle \alpha_k \nabla_\theta \mathcal{L}(\theta_k), \theta \rangle – \frac{\partial}{\partial \theta} \langle \alpha_k \nabla_\theta \mathcal{L}(\theta_k), \theta_k \rangle + (\theta-\theta_k) \\
& = \alpha_k \nabla_\theta \mathcal{L}(\theta_k) + (\theta-\theta_k)
\end{align*}

which corresponds to the basic gradient descent $\theta_{k+1} = \theta_k – \alpha_k \nabla_\theta \mathcal{L}(\theta_k) $. Indeed, gradient descent is a special case of mirror gradient descent with this $h$. Another intuition for recognizing it is that for $h(\theta)=\frac{1}{2}||\theta||^2$, $\nabla_\theta h = \theta, (\nabla_\theta h)^{-1} = \theta $, i.e., the transform is one-to-one mapping. Hence, we get $\theta_{k+1} = \theta_k – \alpha \nabla \mathcal{L}(\theta_k) $.

Example 2. $h(\theta) = \Sigma_i \{\theta_i \ln(\theta_i) – \theta_i\} $ where $\theta$ satisfies $\Sigma_i \theta_i = 1$.

\[\nabla h = [\ln(\theta_1), …, \ln(\theta_n)]^T, (\nabla h)^{-1} = \exp (\theta) \]

\[\langle \nabla h(\theta), \theta’ – \theta \rangle = \Sigma_i \frac{\partial h}{\partial \theta_i} (\theta’_i – \theta_i) = \Sigma_i (\ln \theta_i) (\theta’_i – \theta_i)\]

\[\Rightarrow D_h (\theta’||\theta) = \Sigma_i \{\theta’_i \ln(\frac{\theta’_i}{\theta_i}) – \theta’_i +\theta_i\} = \Sigma_i \theta’_i \ln(\frac{\theta’_i}{\theta_i})\]
using $\Sigma_i \theta_i = \Sigma_i \theta’_i = 1$. Note that with this $h$, Bregman Divergence converges to KL Divergence.
The update equation for mirror gradient descent becomes
\[\theta^{i}_{k+1} \leftarrow \exp(\ln(\theta_k) – \alpha \nabla \mathcal{L}(\theta_k)) = \theta_k \exp(-\alpha \nabla \mathcal{L}(\theta_k))\]

Typically, it’s good to use mirror descent if you are able to pick a good $h$ for that class of problems.

There are also many other variants of gradient descent. Each algorithm possesses unique strengths, you may try to find what suits the best for your own optimization problem. Understanding and utilizing these techniques expands our problem-solving capabilities across diverse fields. So, please feel free to embrace these algorithms and explore their potential in your own endeavors.

Conclusion

Projected and Mirror Gradient Descent are powerful tools for constrained optimization problems in ML. These algorithms are more computationally expensive that regular gradient descent due to their need to address the constraints. The Bregman Divergence is an expressive way to represent many forms of optimization (e.g., Examples 1 and 2). Constrained optimization is key in safety-critical domains, such as ML for robotics and healthcare where constraint satisfaction must be enforced. Note: check out our blog post on KKT conditions for an alternative approach for constrained optimization: The method of Lagrange Multipliers.

Acknowledgements

This blog post was developed in part with support from course materials available at http://www.cs.cmu.edu/afs/cs.cmu.edu/academic/class/15850-f20/www/notes/lec19.pdf

Bootcamp Summer 2020 Week 7 – DQN And The Deadly Triad, Or, Why DQN Shouldn’t Work But Still Does

July 9, 2022 by Laura Strickland and Matthew Gombolay

Deep Q Networks (DQN) are an effective deep learning method for solving many problems. DQN is not, however, foolproof and can be quite difficult to train effectively or can simply fail to learn. The authors of this paper discuss the three main reasons that DQN can fail—the so-called Deadly Triad—and why DQN can sometimes work anyway.

The components of the “Deadly Triad” are:

Function Approximation
Bootstrapping
Off-policy Learning

In this blog post, we define and discuss each of these and some ways in which they can be mitigated.

Why is this important? Everything we know suggests that DQN should not work, and if it does, we got lucky with our setup. But DQN does work, at least some of the time, and more reliably than one might expect. This suggests that we don’t yet fully understand what is going on in DQN. In the remainder of this blog post, we cover what the components of the deadly triad are, the problems they can cause, and some of the ways we can sidestep their undesirable effects.

1. Function Approxmiation

Function approximation in Reinforcement Learning gives us the ability to better handle large state and action spaces by approximating the function we are trying to learn (e.g. a Q-function) rather than learning the function exactly. First, we cover the basics of how RL can work without function approximation, then look at how function approximation is applied when the state and/or action spaces become too large to sample effectively using exact techniques, such as value iteration or policy iteration.

The Q-function $Q(s, a)$—that is, the state-action value function—is a mapping of states and actions to a value in the set of all real numbers:

\begin{align}
Q: S \times A \rightarrow \mathcal{R}
\label{q_mapping}
\end{align}

Recall that value iteration evaluates the utility of a state as the value of the maximum Q-value at that state, over all the actions the agent can take in that state, as we see in Equation \ref{value_iter}.

\begin{align}
V(s) = \max_a Q(s, a)
\label{value_iter}
\end{align}

When we look at methods that use the Q-function, such as value iteration, we form a table for $V$ (see left of Fig. 1) and for the Q-values (see right of Fig. 1) and iteratively update those tables as our estimates of the utility of each state and the expected value of each state-action pair improve during training.

Left side: A table showing the utility of specific states in a simple discrete (or discretized) state space. Right side: A table showing the Q-value of specific state-action pairs in a scenario with simple discrete (or discretized) state and action spaces. (Note that these values are for illustrative purposes only.) — The utility of each state and the Q-value of each state-action pair can be represented in tabular forms. Note that the V(S) values in the left table correspond to the Q-values of the maximizing actions, according to Equation \ref{value_iter}.

Under the tabular (i.e. exact) setting, we can, given a sufficient number of iterations, learn the optimal Q-function, value function, or policy.

Usually, however, our state spaces are far too large for tabular Q-learning to be remotely practical. It simply takes far too long to cover the entire table thoroughly enough to reach convergence to the optimal policy, or anything even close. Even high-performance computers will lack the RAM required to store a large Q-table in memory. Rather than wait years for our learner to visit every state-action pair in the Q-table enough to allow the values to converge, we instead try to form an estimate of what this table would tell us using function approximation.

This function approximation was originally performed by effectively placing multi-dimensional Gaussians in our $Q(s, a)$ table to represent the action choices in the state space with probability distributions. These Gaussians interpolate a desired output value (e.g. a q-value) between inputs (e.g. state-action pairs). The number of Gaussian kernels $\mathcal{k}$ that we need to adequately cover all of the states and actions could be far smaller than the size of the $Q$-table, $|S| \times |A|$.

Modern approaches to function approximation employ neural networks. The neural network learns a surface, such as that shown in Figure 2, that covers what the Q-table did and estimates the utility of taking a specific action in a specific state, just as the Q-table does. Fitting a surface over the state $\times$ action space instead of a table has an additional benefit: when the agent finds itself in a state it has not encountered before, it can still interpolate an estimate of the utility of its current state using the function approximator (i.e. the neural network). An important point to remember, though, is that function approximation, despite this benefit, is almost never going to perfectly learn the Q-function. We can apply gradient descent to get closer to the true value function, but our function approximator will
almost never be perfectly accurate.

In this image, we illustrate that the π value of a specific state-action pair is output by the actor network used in a policy-gradient setup. — An example of a surface learned by a neural network trained with DQN.

Thus, using a function approximation is not true interpolation between the known-value points, but it is something close to it.

Function approximation is also used in policy gradient methods; neural networks can be used to represent the policy, as illustrated in Figure 2, and the transition function, as illustrated in Figure 3.

In this image, we illustrate that the transition probability of starting in state s, taking action a, and arriving in state s' is defined by a neural network. — Function approximation can represent the state transition function in policy gradient learning scenarios.

In a previous entry, we introduced the Bellman Residual (Equation \ref{bellman-res}). If we are parameterizing $Q$ by some $\theta$ such that $Q_{\theta}: S \times A \longrightarrow \mathcal{R}$, then we can use the Bellman residual as the loss function we use to train $Q_{\theta}$.

\begin{align}
L(\theta) = \mathbb{E} \left[ \left((r_t + \gamma \max_a Q_{\theta}(s_{t+1}, a)) – Q_{\theta}(s_t, a_t)\right)^2 \right]
\label{bellman-res}
\end{align}

Equation \ref{bellman-res} aims to minimize the error between the points that the agent has visited and estimated Q for and our predictions of what those points should be.

What is the problem with this? When we apply gradient descent, we assume that $(r_t + \gamma \max_a Q_{\theta}(s_{t+1}, a))$ from Equation \ref{bellman-res} is a constant with respect to $\theta$. Thus, when we take the gradient of $L(\theta)$, the only thing we apply the derivative to is $Q_{\theta}(s_{t+1}, a)$. We will see why this creates challenges for us in the next section.

2. Bootstrapping

With respect to the Deadly Triad, we define bootstrapping as taking information we already have and trying to do more with it than we should be able to do. Let’s look at the instability this can cause and the ways practitioners have come up with to mitigate these instabilities.

When we looked at policy gradients and REINFORCE, we learned that the update for the policy parameters $\phi$ is defined by Equation \ref{policy-grad-eqn}, where $i$ is the designator of the trajectory from which the states and actions are drawn (for the case of doing multiple policy rollouts), and $t$ is the timestep.

\begin{align}
\Delta \phi = A_{it} \nabla_{\phi} \log \pi_{\phi}(s_{it}, a_{it})
\label{policy-grad-eqn}
\end{align}

We compute $A_{it}$, the cumulative reward term, as shown in Equation \ref{a-i-t}.

\begin{align}
A_{it} = \sum_{t’=t}^{T_i} \left(\gamma^{t’-t} r_{t’} \right)
\label{a-i-t}
\end{align}

How is this applied? When computing $\Delta \phi$ and $A_{it}$, we look at each timestep in the trajectory from timestep $t$ until the end, identify what the reward is, and back up the (discounted) reward across prior timesteps. This means that a reward applied many timesteps into the future still affects this cumulative reward term, $A_{it}$. In a scenario that gives sparse rewards, backing up rewards to influence the
evaluations of earlier timesteps in this manner is particularly helpful, as it helps the agent learn the steps it took along the way to getting a reward at some specific timestep. This gives the learner information from later in the trajectory instead of it needing to rely on a Q-function to tell it whether it is going in a (positively) rewarding direction.

Notably, an alternative way to compute the gradient of $\phi$ is Equation \ref{alt-phi-grad}.

\begin{align}
\Delta \phi = Q_{\phi}(s, a) \nabla_{\phi} \log \pi_{\phi}(s_{it}, a_{it})
\label{alt-phi-grad}
\end{align}

When we are training $Q_{\phi}(s, a)$ with the Bellman residual and holding $(r_t + \gamma \max_a Q_{\phi}(s_{t+1}, a))$ constant, we encounter what we call the “chasing one’s tail” or “runaway” effect, which we discuss in the blog post about DQN. Specifically, in Equation \ref{alt-phi-grad}, we are using the Q-function as its own target by regressing $Q(s, a)$ towards the target, $R(s, a) + \gamma \max_{a’} Q(s’, a’)$, which itself contains $Q$!.
That is what bootstrapping means in the context of the Deadly Triad— using the thing that you are trying to achieve as the very thing you are using to achieve it.

What does not bootstrapping look like in Q-learning? Instead of computing $L(\theta)$ as we did in Equation \ref{bellman-res}, we compute the loss by looking $n$ steps into the future, as we demonstrate in Equation \ref{n-step-lookahead}.

\begin{align}
\hspace{-3.75em}L(\theta) = \mathbb{E} \left[ \left(\sum_{t’ = t}^{t + n – 1} \left(r_t + \gamma^{t’ – t} r_{t’}\right) +
\gamma^{t + n} \max_a Q(s_{t+n}, a) – Q_{\theta}(s_t, a_t) \right)^2 \right]
\label{n-step-lookahead}
\end{align}

We select $n$ such that the $n$th step will be beyond the final timestep of the episode, which simplifies Equation \ref{n-step-lookahead} to Equation \ref{n-step-lookahead-simp}.

\begin{align}
L(\theta) = \mathbb{E} \left[ \left(\sum_{t’ = t}^{t + n – 1} \left(r_t + \gamma^{t’ – t} r_{t’}\right) – Q_{\theta}(s_t, a_t) \right)^2 \right]
\label{n-step-lookahead-simp}
\end{align}

Q-learning is known for having an overestimation problem; the $\max$ operation that determines which action to select in a given state biases the estimation of Q-values of states to the high side, sometimes to the point of drowning out the reward signal. Sometimes this problem can be fixed by hyperparameter tuning, but not always. So how do we keep the bias caused by bootstrapping at bay?

How to fix bootstrapping?

A common tactic to alleviate the bias caused by bootstrapping is by applying Target Q-learning; that is, we have two Q-networks, and calculate the Bellman residual as shown in Equation \ref{loss-fcn-targ}.

\begin{align}
L(\theta) = \mathbb{E} \left[ \left((r_t + \gamma \max_{a’} Q_{\theta’}(s_{t+1}, a’)) – Q_{\theta}(s_t, a_t) \right)^2 \right]
\label{loss-fcn-targ}
\end{align}

The difference between Equations \ref{bellman-res} and \ref{loss-fcn-targ} is that Equation \ref{loss-fcn-targ} uses the target Q-network to compute the maximizing action for time $t+1$ rather than the same Q-network as in the latter part of the loss function equation. We update the target Q-network as the main Q-network trains, as in Equation
\ref{target-update}, but it lags behind; $\alpha$ is a small scalar. This lagging target network update can help to stabilize the runaway chasing-one’s-tail effect.

\begin{align}
\phi \leftarrow \phi \left(1 – \alpha\right) + \alpha \theta
\label{target-update}
\end{align}

Another approach to ameliorating the problems of bootstrapping is Inverse Double Q-learning. The $\max_{a’}$ term in the Bellman residual equation (Equation \ref{bellman-res}) is replaced with $Q_{\theta}(s_{t+1}, \underset{a’}{\operatorname{argmax}}~ Q_{\phi}(s_{t+1}, a’))$, where $Q_{\theta}$ is for the “real” network and $Q_{\phi}$ is the target network. Equation \ref{inv-double-q} is the resulting loss function for Inverse Double Q-learning.

\begin{align}
\hspace{-3.25em}L(\theta) = \mathbb{E} \left[ \left((r_t + \gamma Q_{\theta}(s_{t+1}, \underset{a’}{\operatorname{argmax}}~ Q_{\phi}(s_{t+1}, a’))) – Q_{\theta}(s_t, a_t)\right)^2 \right]
\label{inv-double-q}
\end{align}

In this case, we still use the learner network to estimate the Q-value, but rather than picking the action that is maximizing in state $s_{t+1}$ according to the learner network, we use the target network to evaluate the maximizing action.

Yet another approach to ameliorate bootstrapping, Double Deep Q-Networks (DDQN), replaces the $\max$ term in the Bellman residual with $Q_{\phi}(s_{t+1}, \underset{a’}{\operatorname{argmax}}~ Q_{\theta}(s_{t+1}, a’))$, a term that uses the target network (subscript $\phi$) to evaluate the value of a state-action pair and uses the learner network (subscript $\theta$) to evaluate which action is the maximizing action in that state. The resulting loss function for DDQN is given in Equation \ref{ddqn-targ}.

\begin{align}
\hspace{-3.25em}L(\theta) = \mathbb{E} \left[ \left((r_t + \gamma Q_{\phi}(s_{t+1}, \underset{a’}{\operatorname{argmax}}~ Q_{\theta}(s_{t+1}, a’))) – Q_{\theta}(s_t, a_t)\right)^2 \right]
\label{ddqn-targ}
\end{align}

The advantage of DDQN’s approach is that we use a target network, but the action selection is decoupled from between the target and learner networks. In general, DDQN is generally accepted as the current “best” method for ameliorating the effects of bootstrapping. An additional variant on this approach is to use two target networks, evaluate which action to select using both target networks, and choosing the action that gives the lower Q-value.

In general, all of these methods of mitigating the chasing-one’s-tail problem of bootstrapping involve using a target network and decoupling the action selection from the evaluation of the state-action pair’s value in calculating the Bellman residual.

3. Off-policy Learning

In the context of the Deadly Triad, we primarily use the term “Off-policy learning” to refer to RL setups that utilize experience replay during training. Specifically, we utilize data generated using an older version of our Q-function under an $\epsilon$-greedy sampling strategy to improve our current estimate of the Q-function when we operate under it with $\epsilon = 0$. For more detail on on- and off-policy learning, please see the blog post on the subject. Looking at that blog post and what it says about Deep Deterministic Policy Gradients (DDPG), we know that off-policy learning says that $Q_{\theta}(s, a)$ is not a critic of $\pi’
(\cdot | s)$, an arbitrary policy network. Instead, it is a critic of $\pi^*(s) = \underset{a}{\operatorname{argmax}}~Q(s,a)$. The process by which we generate the state-action pairs to train our Q-function can be different from what the critic is evaluating; that is, we decouple sampling/exploration from what the critic is judging. This decoupling is generally considered preferable for sample-efficiency reasons, among others. Off-policy learning, however, is difficult to do well in the context of experience replay.

What is Experience Replay?

In experience replay, we have a database, $D = \{ \langle s_0, a_0, r_0, s_0′ \rangle, \langle s_1, a_1, r_1, s_1′ \rangle, \dots \}$, in which we collect sample experiences from which we train our policy network. These databases, often called replay buffers, are key components of many RL frameworks.

Even though we train from a replay buffer of past experiences, we can modulate the degree to which we train our network off-policy. Q-learning is not technically off-policy, as, when taking the expectation of the Bellman residual, the states from which we compute the expectation must be drawn from somewhere. The distribution from which the states are drawn in computing that expectation should be a uniform distribution over all states, as in Value Iteration, but instead, we draw the state-action pairs from the replay buffer, whih is biased towards some state-action pairs over others based upon our sampling history. The replay buffer is skewed towards the policy used to generate the experiences
within it.

In the paper that inspired this blog post, van Hasselt, et al. explore prioritizing the experiences in their agent’s replay buffer by setting the probability of sampling a specific point $p_i$ from the replay buffer to be proportional to the error—the Bellman residual. That is, they rank the samples in the replay buffer according to how wrong the function approximator is when evaluating them and then train the Q-function to prioritize selecting points that are ranked as being more incorrect so that the learner can fix its evaluation of the Q-value estimate of those experiences. While this approach seems reasonable—a human who is very unskilled at a specific task who practices that task is sure to learn to improve their performance of that task, even if only a little—it actually increases the likelihood of the Q-network diverging when compared to training the Q-network on uniformly-sampled experiences. The takeaway here is that, while prioritizing specific examples in a replay buffer is an interesting area of research, if we just want our learner to learn, sampling non-uniformly from our replay buffer could make things worse. An additional divergence risk is the size of the neural network itself. In general, for DQN, the risk of the function approximator diverging increases as the network size increases. Conversely, for DDQN, the risk of the network diverging is agnostic of network size.

Conclusions

DQN and its variants are frequently used because, not only are they widely applicable, but they work in a lot of scenarios. The Deadly Triad—function approximation, bootstrapping, and off-policy learning— can all cause complications in the application of DQN, however; in this blog post, we covered these challenges and common approaches to handling them. The main takeaways from this blog post are:

Function approximation does not have the same nice guarantees that tabular learning has, but it is far more flexible when the state and/or action spaces are large.
To ameliorate the instability bootstrapping can cause, consider either using a multi-step update in the Bellman residual or using DDQN.
When using a replay buffer, sample from the replay buffer uniformly—don’t try to prioritize any experiences over others

Many thanks to Ruisen “Eric” Liu for editing this blog post and providing suggestions for how to improve it.

References

Bootcamp Summer 2020 Week 4 – On-Policy vs Off-Policy Reinforcement Learning

February 28, 2022 by mschrum3 and Matthew Gombolay

Reinforcement learning algorithms can be divided into two main types: on-policy and off-policy. Understanding the differences between on-policy and off-policy learning is essential to understanding and implementing various reinforcement learning algorithms. Below we will discuss the key differences in on-policy and off-policy learning, the algorithms that employ these two different approaches, and the implications of these learning approaches.

What is on-policy versus off-policy?
Examples of on-policy and off-policy algorithms
Implications
Advantages and Disadvantages
Takeaways

Break-down of types of reinforcement learning algorithms.

1. What is on-policy versus off-policy?

On-policy and off-policy learning fall under the category of model-free reinforcement learning algorithms, meaning that we do not have access to the transition probability distribution. This is opposed to model-based methods such as Monte-Carlo Tree Search (MCTS).

To begin our discussion of on-policy and off-policy reinforcement learning, we will first introduce some terminology. In a reinforcement learning setting, an agent interacts with an environment, moving from state, $s$, to $s’$ via action, $a$, and transition probability, $T(s,a,s’)$. The agent’s policy, $\pi(s)$ dictates the action that an agent takes in state, $s$. The objective of reinforcement learning is to find the optimal policy, $\pi^*$, which maximizes future discounted expected reward. The Q-function, $Q(s,a)$, describes the value of a state action pair. If we know the optimal Q-function, $Q^*$, which describes the maximum expected reward of any given state-action pair, then we can extract the optimal policy according to Eq. 2.

\begin{equation}
Q^*=R(s,a)+\gamma \mathop{\mathbb{E}}_{s’}[V^*(s’)]
\end{equation}

\begin{equation}
\pi^*(s)=argmax_{a}Q^{*}(s,a) \quad \forall s
\end{equation}

To understand the difference between on-policy learning and off-policy learning one must first understand the difference between the behavior policy (i.e., sampling policy) and the update policy.

Behavior Policy: The behavior policy is the policy an agent follows when choosing which action to take in the environment at each time step. In Q-learning, our behavior policy is typically an $\epsilon$-greedy strategy. In an $\epsilon$-greedy behavior policy, an agent chooses the action dictated by the current policy, $\pi(s)$, with probability 1-$\epsilon$ and a random action with probability $\epsilon$. The behavior policy generates actions and determines how the agent interacts with the environment.

Update Policy: The update policy is central to understanding the difference between on-policy and off-policy learning. The update policy is how the agent updates the Q-function. In other words, it dictates how the agent derives the state-action pairs which are used to calculate the difference between the actual Q-value and current predicted Q-value, also known as the TD-error. The TD-error is then used to update the Q-function.

This brings us to the key difference between on-policy and off-policy learning: On-policy algorithms attempt to improve upon the current behavior policy that is used to make decisions and therefore these algorithms learn the value of the policy carried out by the agent, $Q^{\pi}$. Off-policy algorithms learn the value of the optimal policy, $Q^*$, and can improve upon a policy that is different from the behavior policy. Determining if the update and behavior policy are the same or different can give us insight into whether or not the algorithm is on-policy or off-policy. If the update policy and the behavior policy are the same, then this suggest but does not guarantee that the learning method is on-policy. If they are different, this suggests that the learning method is off-policy.

Let’s take a look at some examples.

2. Examples

Here we discuss several examples of on-policy versus off-policy algorithms and highlight the key differences between them. The important lines in each algorithm are shown in red.

Below is the pseudo-code for SARSA (State, Action, Reward, State, Action), an on-policy algorithm.

SARSA often utilizes an $\epsilon$-greedy behavior policy and follows the same policy when updating the Q-function, meaning that $a’$ is also selected based upon an $\epsilon$-greedy strategy as shown in Line 7. Thus, because SARSA learns the quality of the behavior policy rather than the quality of the optimal policy, SARSA is an on-policy algorithm.

Below is the pseudo-code for Q-learning.

The key difference between Q-learning and SARSA is the
$max_{a’} Q(s’,a’)$ term found in the Q-learning algorithm. This $\max$ term means that Q-learning is learning the value of the optimal policy. SARSA’s Q-function models the q-values of the behavior policy whereas Q-learning learns the optimal q-values. This distinction is what makes SARSA on-policy and Q-learning off-policy.

Let’s take a look at another algorithm, Deep Deterministic Policy Gradient (DDPG). DDPG is an actor-critic method. This means that it concurrently learns both a policy as represented by the actor and the Q-function as represented by the critic. DDPG also utilizes experience replay in which it stores state-action-state-reward tuples in a buffer and samples these examples during training. We focus on the part of DDPG that is relevant to our on-policy versus off-policy discussion. Below is a subsection of the pseudocode. $\pi_{\theta}$ describes the policy the agent is learning parameterized by $\theta$ and the Q-function is parameterized by $\phi$.

We see in Line 6 that DDPG utilizes a stochastic behavior policy defined by the actor $\pi$ plus exploration noise. What about the update policy? The update policy is shown in Line 11. Unlike in Q-learning, there is no max when selecting the next action in Line 11. Instead we see that we are using the actor $\pi_{\theta}$ to selection $a’$ which sounds like DDPG is optimizing its current policy and therefore is an on-policy algorithm. So does this mean that, because DDPG is updating the Q-function via $Q_{\phi}(s’,\pi_{\theta}(s’))$ instead of $\max_{a’}$ $Q_{\phi}(s’,a’)$, that DDPG is on-policy? No! Based on our definition, an on-policy algorithm improves upon the current policy, whereas an off-policy algorithm learns the value of the optimal policy. With DDPG, as time goes to infinity, the actor selects the action which maximizes the Q-function as shown in Line 12. Because at convergence, $\pi_{\theta}$ is optimal and therefore $Q_{\phi}(s’,\pi_{\theta}(s’))=\max_{a’}$ $Q_{\phi}(s’,a’)$, DDPG is evaluating the optimal policy, making it an off-policy algorithm.

To the right is a list of other common on-policy and off-policy algorithms. The same analysis of the behavior policy and the update policy can be used to verify the type of learning that each of these algorithms employs.

3. Implications

So what are the implications of these differences? In the end does the type of learning matter? When should we use which algorithm?

This figure depicts the cliff walking domain. The SARSA agent learns the safer path shown in blue and the Q-learning agent learns the optimal path shown in red [1].

Different Policy Outcomes: First of all, these different training methods can result in different behaviors due to the fact that on-policy methods learn $Q^\pi$ and off-policy methods learn $Q^*$. For example, in Q-learning because the update policy follows a greedy policy, it assumes that the agent will act optimally in the future and therefore, Q-learning will learn the optimal policy as long as all states and actions are sufficiently experienced. As shown in the above figure, this means that the cliff walking agent will learn to take the optimal path from the start to goal even though this path may be dangerous if the agent’s actions are stochastic.

SARSA, on the other hand, assumes that the agent will follow the current policy in the future. This means that when updating the Q-function, we assume that the cliff walking agent will at times jump over the cliff if it travels too close to the edge. Therefore, SARSA learns a policy which is safer and ensures that the agent will avoid the cliff. SARSA only learns a near-optimal policy that depends on the behavior strategy. For example, if the behavior strategy is $\epsilon$-greedy then SARSA will learn at optimal $\epsilon$-greedy strategy. To more closely approximate the optimal policy, one can decay $\epsilon$ to decrease exploration over time. SARSA is guaranteed to converge to the optimal policy if the behavior policy converges to a greedy policy and all states and actions are visited an infinite number of times.

This figure shows the rewards received by an agent trained via SARSA and an agent trained via Q-learning in the cliff walking domain. In this domain, SARSA achieves higher rewards than Q-learning despite learning a sub-optimal policy [1].

Although Q-learning learns the optimal strategy, SARSA may have better online performance. As shown in the plot, SARSA obtains higher episodic rewards on average than Q-learning in the cliff domain due to the fact that the agent takes a safer route. However, this may not be the case for all domains.

When to employ which learning strategy: If, for example, one is utilizing reinforcement learning to train a physical robot then one may opt for a more conservative and safer learning scheme such as the on-policy SARSA. SARSA also tends to be more stable while training. However, if training in simulation, Q-learning may be more effective since it is more likely to learn the optimal policy and less likely to become trapped in a local minimum. Additionally, off-policy algorithms can take advantage of experience replay when the behavior policy is changing or even if the behavior policy is generated from a source other than the learning agent (e.g., a human demonstrator). On-policy algorithms can only utilize experience replay when the behavior policy is static. Experience replay allows us to utilize historical data when learning which can help de-correlate training examples and is useful when gathering experience is costly. Additionally, off-policy learning is agnostic to how the data is generated and can therefore utilize data generated from other sources. Consequently, off-policy learning may be preferable in situations in which the data must be generated in a way that does not follow the current policy, for example, via human demonstrations.

4. Advantages and Disadvantages

5. Takeaways

To identify if a reinforcement learning algorithm is off-policy or on-policy, one must first identify the behavior policy and the update policy. If the algorithm attempts to improve the behavior policy, it is on-policy. If however, the algorithm learns the optimal policy, then it is off-policy.
SARSA is an example of on-policy learning and Q-learning is an example of off-policy learning. Algorithms such as DDPG, which learn the quality of the optimal policy, are off-policy strategies.
The strategies that results from on-policy learning versus off-policy learning can differ in a number of ways including how safe the policy is, its convergence guarantees, and the online performance.

References
[1] Richard S. Sutton and Andrew G. Barto. Reinforcement Learning: An Introduction. The MIT Press, second edition, 2018.

AAMAS 2022 Tutorial – Graph Neural Networks: Learning Representations of Robot Team Coordination Problems

January 18, 2022 by Matthew Gombolay and Zheyuan Wang

Tutorial at the International Conference on Autonomous Agents and Multi-Agent Systems 2022

Overview

Robot teams are increasingly being deployed in environments, such as manufacturing facilities and warehouses, to save cost and improve productivity. To efficiently coordinate multi-robot teams, fast, high-quality scheduling algorithms are essential to satisfy the temporal and spatial constraints imposed by dynamic task specification and part and robot availability. Traditional solutions include exact methods, which are intractable for large-scale problems, or application-specific heuristics, which require expert domain knowledge. What is critically needed is a new, automated approach that automatically learn lightweight, application-specific coordination policies without the need for hand-engineered features.

This tutorial is an introduction to graph neural networks and a showcase of the power of graph neural networks solving multirobot coordination problems. We survey various frameworks of graph neural networks in recent literature, with a focus on their application in modeling multi-agent systems. We will introduce the multi-robot coordination (MRC) problems and review the most relevant methods available to solving MRC problems. We will discuss several successful applications of graph neural networks in MRC problems, with hands-on tutorials in the form of Python example code. With this tutorial we aim at providing an experience of employing graph neural networks in modeling multi-robot systems, from algorithm development to code implementation, thus opening up future opportunities in designing graph-based learning algorithms in broader multi-agent research.

Content

In this tutorial, we will discover the power of graph neural networks for learning effective representations of multi-robot team coordination problems. The tutorial will feature two 90-minute online sessions.

Date and Time: May 8, 11:00 – 14:00 (EDT)

3:00 – 6:00, May 9, Auckland Time
11:00 – 14:00, May 8, New York Time
17:00 – 20:00, May 8, Paris Time
20:30 – 23:30, May 8, Kolkata Time
23:00 – 2:00 (+1 day) May 8, Beijing Time

Zoom link: https://gatech.zoom.us/j/93966169541 [Video recordings are posted below]

Session #1 by Matthew Gombolay (11:00 – 12:30)

The first session will address the following: (a) How graph neural networks work – we will present a comprehensive overview of various graph neural networks proposed in prior literature, including both homogeneous and heterogeneous graphs and attention mechanisms; (b) How to model team coordination problems with graph neural networks – we will discuss what are the suitable applications that can be modeled in graph neural networks, with a focus on MRC problems; (c) How to optimize the parameters of graph neural networks for team coordination problems – we will discuss what learning methods can be used for training a graph neural network-based solver. We conclude this session with the most recurrent challenges and open questions.

Tutorial notes: [Slides]

Video:

Session #2 by Zheyuan Wang (12:30 – 14:00)

The second session will provide a hands-on tutorial for how to get up and running with graph neural networks for coordination problems, with coding examples in Python Jupyter notebook. In particular, we will look into the ScheduleNet architecture [6], a heterogenous graph neural network-based solver for MRC problems under temporal and spatial constraints. The Jupyter notebook will work through the model implementation, training and evaluation of ScheduleNet models on synthetic dataset.

Python Jupyter notebook: [Open in Google Colab] [Github]

Video:

Target Audience

The target audience is intended to be students and researchers who have a strong grasp of machine learning but who may as of yet be unfamiliar with graph neural networks. Prior knowledge of Multi-agent Reinforcement Learning (MARL) or Planning & Scheduling would be helpful but is not necessary. While the tutorial showcases the application of graph neural networks in solving multi-robot coordination problems, the methodology introduced can be utilized by the audience to broader research problems in learning representations of multi-agent systems.

Presenters

Dr. Matthew Gombolay is an Assistant Professor of Interactive Computing at the Georgia Institute of Technology. He received a B.S. in Mechanical Engineering from the Johns Hopkins University in 2011, an S.M. in Aeronautics and Astronautics from MIT in 2013, and a Ph.D. in Autonomous Systems from MIT in 2017. Gombolay’s research interests span robotics, AI/ML, human–robot interaction, and operations research. Between defending his dissertation and joining the faculty at Georgia Tech, Dr. Gombolay served as a technical staff member at MIT Lincoln Laboratory, transitioning his research to the U.S. Navy, earning him an R&D 100 Award. His publication record includes a best paper award from American Institute for Aeronautics and Astronautics, a finalist for best student paper at the American Controls Conference, and a finalist for best paper at the Conference on Robot Learning. Dr Gombolay was selected as a DARPA Riser in 2018, received 1st place for the Early Career Award from the National Fire Control Symposium, and was awarded a NASA Early Career Fellowship for increasing science autonomy in space.

Zheyuan Wang is a Ph.D. candidate in the School of Electrical and Computer Engineering (ECE) at the Georgia Institute of Technology. He received the B.S. degree and the M. E. degree, both in Electrical Engineering, from Shanghai Jiao Tong University. He also received the M.S. degree from ECE, Georgia Tech. He is currently a Graduate Research Assistant in the Cognitive Optimization and Relational (CORE) Robotics lab, led by Prof. Matthew Gombolay. His current research interests focus on graph-based policy learning that utilizes graph neural networks for representation learning and reinforcement learning for decision-making, with applications to human-robot team collaboration, multi-agent reinforcement learning and stochastic resource optimization.

Reading Materials

Ernesto Nunes, Marie Manner, Hakim Mitiche, and Maria Gini. 2017. A taxonomy for task allocation problems with temporal and ordering constraints. Robotics and Autonomous Systems 90 (2017), 55–70.
Petar Veličković, Guillem Cucurull, Arantxa Casanova, Adriana Romero, Pietro Liò, and Yoshua Bengio. 2018. Graph attention networks. International Conference on Learning Representations (2018).
Xiao Wang, Houye Ji, Chuan Shi, Bai Wang, Yanfang Ye, Peng Cui, and Philip S Yu. 2019. Heterogeneous graph attention network. The World Wide Web Conference (2019), 2022–2032.
Jie Zhou, Ganqu Cui, Shengding Hu, Zhengyan Zhang, Cheng Yang, Zhiyuan Liu, Lifeng Wang, Changcheng Li, and Maosong Sun. 2020. Graph neural networks: A review of methods and applications. AI Open 1 (2020), 57-81.
Zheyuan Wang and Matthew Gombolay. 2020. Learning scheduling policies for multi-robot coordination with graph attention networks. IEEE Robotics and Automation Letters 5, 3 (2020), 4509–4516.
Zheyuan Wang, Chen Liu, and Matthew Gombolay. 2021. Heterogeneous graph attention networks for scalable multi-robot scheduling with temporospatial constraints. Autonomous Robots (2021), 1–20.

Bootcamp

Introduction

Equality Constraints in a learning problem

Inequality constraints in the optimization problem

Key Takeaways

1. Function Approxmiation

2. Bootstrapping

How to fix bootstrapping?

3. Off-policy Learning

What is Experience Replay?

Conclusions

References

1. What is on-policy versus off-policy?

2. Examples

3. Implications

4. Advantages and Disadvantages

5. Takeaways

Overview

Content

Session #1 by Matthew Gombolay (11:00 – 12:30)

Session #2 by Zheyuan Wang (12:30 – 14:00)

Target Audience

Presenters

Reading Materials

Footer