Rohan Paleja

Bootcamp Summer 2023 Week 1 – Projected & Mirror Gradient Descent

July 30, 2023 by Byeolyi Han, Arthur Scaquetti do Nascimento, Rohan Paleja and Matthew Gombolay

Gradient descent is an ubiquitous optimization algorithm used to train machine learning (ML) algorithms from simple linear regression models to sophisticated transformer architectures. ML researchers are typically introduced to unconstrained optimization problems (e.g., least-squares, LASSO, or Ridge regression). However, many really-world ML problems are constrained optimization problems. For example, in robotics, we may want to train a robot for maneuvering around a crowd by imitating human motion (e.g., Behavior Cloning where the robot minimizes the divergence between the distribution of actions the robot takes and that the human would take in the same state of the world, $x$) subject to the constraint that the robot will never crash into a person.

In this blog post, we will look at some powerful approaches for performing constrained optimization. First, we quickly recap the concept of gradient descent. Next, we’ll delve into Projected Gradient Descent, a technique that introduces constraints to the optimization process. Finally, we’ll uncover the magical world of Mirror Gradient Descent, a method that combines the power of duality and optimization. For an example of how Mirror Gradient Descent has been used, check out these papers:

Montgomery, W.H. and Levine, S., 2016. Guided policy search via approximate mirror descent. Advances in Neural Information Processing Systems, 29.
Zhang, M., Geng, X., Bruce, J., Caluwaerts, K., Vespignani, M., SunSpiral, V., Abbeel, P. and Levine, S., 2017, May. Deep reinforcement learning for tensegrity robot locomotion. In 2017 IEEE international conference on robotics and automation (ICRA) (pp. 634-641). IEEE.
Cheng, C.A., Yan, X., Wagener, N. and Boots, B., 2018. Fast policy learning through imitation and reinforcement. arXiv preprint arXiv:1805.10413.

Recap: Gradient descent
Imagine we are trying to train a neural network with a parameter $\theta$. Given $\theta$, the neural network produces the prediction $f(\theta)$. We have a cost function $\mathcal{L}(\theta)$. Our goal is to find the optimal parameter $\theta^*$ that minimizes $\mathcal{L}(\theta)$. Let’s consider the standard $L^2$-norm loss function with respect to our dataset $D$, defined as follows:
\[\mathcal{L}(\theta) = \Sigma_{(x,y) \in D} \frac{1}{2} ||y-f_\theta (x)||^2 .\]

Starting from an initial point $\theta_0$, Gradient Descent performs the following update at every timestep:
\[\theta_{t+1} \leftarrow \theta_t -\alpha \nabla_\theta \mathcal{L}(\theta_t).\]

Note that we are going opposite to the gradient direction (minus sign in the gradient) since we are “minimizing” the loss function. When the dataset is too large to compute the gradient at every timestep, we use Stochastic Gradient Descent which computes the gradient with respect to a single example, or Mini-batch Gradient Descent which computes the gradient with respect to a small batch sampled from the dataset at every timestep.

For more information such as time complexity and convergence rate, please refer to our previous blog post: Bootcamp Summer 2020 Week 1 – Gradient Descent.

Constrained Optimization Problem
In many real-world scenarios, optimization problems are often attached to certain constraints or boundaries that resulting parameters must satisfy. These constraints can represent physical limitations, practical considerations, or desired properties of the solutions. For example, consider a manufacturing company that produces various products. The company aims to maximize the overall utility derived from these products, taking into account factors such as customer demand, quality, and profitability. However, the company faces a budget constraint, which limits the total amount of money available for production.

To formulate this as a constrained optimization problem, let’s first define the following variables:
$x_1, \cdots, x_n$: The quantities of each product to be manufactured.
$U(x_1, \cdots, x_n)$: The utility function that quantifies the overall satisfaction derived from the products.
$c_1, \cdots, c_n$: The respective costs of producing each product.

The objective is to maximize $U(x_1, x_2, \cdots, x_n)$ while satisfying the constraint of the total budget, $B$. Mathematically, the constrained optimization problem can be formulated as:

\begin{align*}
Maximize_{x_1, \cdots, x_n} \text{ } & U(x_1, \cdots, x_n)\notag\\
\text{subject to } & \Sigma_{i=1}^n x_i*c_i \leq B.\notag
\end{align*}

This example highlights the need for algorithms to solve such constrained optimization problems that arise in various fields, such as resource allocation, portfolio optimization, and image reconstruction.

Projected Gradient Descent

Then, how can we incorporate the constraints into the optimization process so that we can balance between optimizing the objective function and constraining parameters? Projected Gradient Descent (PGD) addresses this challenge by ensuring that the iteratively updated parameters belong to the feasible region defined by the constraints.

The key idea behind PGD is to project the current solution onto the feasible region at each iteration. This projection step involves mapping the updated parameter onto the closest point within the feasible region in case if lands outside of it, effectively “projecting” it onto the constraint boundaries. By doing so, PGD guarantees that the resulting parameters remain feasible throughout the optimization process.

To illustrate, let’s consider a constraint $A\theta = b$ for our optimization problem. Assume Q is the feasible set for the constraint (the red region from Fig. 1) and P is the projection function onto $Q$:

\[Q = \{\theta|A\theta = b\},\]
\[P: \theta’ \mapsto arg\,min_{\theta \in Q} \frac{1}{2} ||\theta – \theta’||^2 .\]

Projected gradient descent is defined as follows:

\[\theta_{t+1} \leftarrow P(\theta_t – \alpha_t \nabla_\theta \mathcal{L}(\theta_t)).\]

Projected gradient descent. The blue arrow denotes the usual gradient descent step on the loss function. The red dotted line denotes the projection onto the feasible set.

For the Lipshitz constant $L$ and the step-size $\alpha_t = \frac{R/G}{\sqrt{t}}$, it is known that if $R \geq ||\theta_0 – \theta||_2$ and the gradient of $\mathcal{L}$ is bounded by a constraint $G$, i.e., $G \geq ||g||_2$, $\forall g \in \frac{\partial \mathcal{L}}{\partial \theta}$, the inequality $\mathcal{L}(\theta_t) – \mathcal{L}(\theta^*) \leq C \frac{L}{\sqrt{t}}$ holds.

Mirror Gradient Descent

Mirror Gradient Descent (MGD) is a fascinating optimization technique that combines the power of optimization with the concept of duality. It draws inspiration from convex duality, a fundamental concept in convex optimization that establishes a relationship between primal and dual problems. MGD exploits this duality to transform the original optimization problem into a dual problem and solves it using a gradient descent-like approach.

In MGD, the optimization process involves iteratively updating the primal parameter in the primal space, while simultaneously updating the corresponding dual parameter in the dual space. The updates are performed by taking steps in the direction opposite to the gradients of the primal and dual objectives. This dual update is analogous to a mirror reflection of the primal update, which is the reason why this algorithm is called by the name “Mirror Gradient Descent.”

The motivation to work in the dual space is that it can often be simpler or have desirable geometric or algebraic properties, leading to more efficient optimization. MGD has been applied to various domains, including convex optimization, machine learning, and signal processing. It offers a powerful framework to tackle optimization problems with complex constraints or objectives, leveraging the interplay between primal and dual updates to achieve efficient convergence and desirable parameters.

Before we dig into MGD more, let’s first define Bregman Divergence of a function $h$ as $D_h (x’||x)$.

\[D_h (x’||x):= h(x’)-h(x)-\langle \nabla h(x), x’-x \rangle \]
($\langle \cdot,\cdot \rangle$ is dot product.)

If we do the first order Taylor expansion of a function $h$, the tangential line at $h(x)$ crosses a point $(x’, h(x) + \langle \nabla h(x), x’-x) \rangle) $. Bregman Divergence $D_h (x’||x)$ is the gap between $h(x’)$ and $h(x) + \langle \nabla h(x), x’-x) \rangle $. It becomes large when the function $h$ is steeper.

Mirror gradient descent is defined as follows:
\[\theta_{k+1} \leftarrow arg\,min_\theta {\langle \alpha_k \mathcal{L}(\theta_k), (\theta – \theta_k) \rangle + D_h (\theta||\theta_k)}. \]

The concept of mirror gradient descent is that we first transform (with $\nabla h(\theta_k)$) the primal space of $\theta$ into the dual space of $\phi$ which possibly have an easier geometry, update the parameter, and do inverse-transform to come back to the primal space.

Mirror gradient descent. The blue dotted line denotes the projection onto the feasible set of the primal space.

Now, let’s first get a mirror image in the dual space and then project it back (with $(\nabla h)^{-1} (\phi_{k+1}))$ to the primal space.

\begin{align*}
0 & = \frac{\partial}{\partial \theta} \{\alpha \langle \nabla \mathcal{L}(\theta_k), \theta_{k+1}-\theta_k \rangle + h(\theta_{k+1}) – h(\theta_k) – \langle \nabla h(\theta_k), \theta_{k+1} – \theta_k \rangle \} \\
& = \alpha \nabla \mathcal{L}(\theta_k) + \nabla h(\theta_{k+1}) – \nabla h(\theta_k)
\end{align*}

Finally, we get the update equation for mirror gradient descent:
\[\nabla h(\theta_{k+1}) = \nabla h(\theta_k) – \alpha \nabla \mathcal{L}(\theta_k) \]
\[ \theta_{k+1} \leftarrow (\nabla h)^{-1} [\nabla h(\theta_k) – \alpha \nabla \mathcal{L}(\theta_k)]. \]

Note that for constrained optimization program, the projection onto the feasible set can be applied afterward.

Example 1. For $h(\theta)=\frac{1}{2}||\theta||^2$, Bregman Divergence becomes $D_h(\theta||\theta_k) = \frac{1}{2} ||\theta-\theta_k||^2 $. Substituting Bregman Divergence and rewriting the update equation, we get

\[\theta_{k+1} \leftarrow arg\,min_\theta \{ \langle \alpha_k \nabla_\theta \mathcal{L}(\theta_k), (\theta – \theta_k)\rangle + \frac{1}{2} ||\theta – \theta_k||^2 \}.\]
Then,

\begin{align*}
0 & = \nabla_\theta f = \frac{\partial f}{ \partial \theta} \\
& = \frac{\partial}{\partial \theta} \{\langle \alpha_k \nabla_\theta \mathcal{L}(\theta_k), (\theta – \theta_k) \rangle + \frac{1}{2} ||\theta – \theta_k||^2\} \\
& = \frac{\partial}{\partial \theta} \langle \alpha_k \nabla_\theta \mathcal{L}(\theta_k), \theta \rangle – \frac{\partial}{\partial \theta} \langle \alpha_k \nabla_\theta \mathcal{L}(\theta_k), \theta_k \rangle + (\theta-\theta_k) \\
& = \alpha_k \nabla_\theta \mathcal{L}(\theta_k) + (\theta-\theta_k)
\end{align*}

which corresponds to the basic gradient descent $\theta_{k+1} = \theta_k – \alpha_k \nabla_\theta \mathcal{L}(\theta_k) $. Indeed, gradient descent is a special case of mirror gradient descent with this $h$. Another intuition for recognizing it is that for $h(\theta)=\frac{1}{2}||\theta||^2$, $\nabla_\theta h = \theta, (\nabla_\theta h)^{-1} = \theta $, i.e., the transform is one-to-one mapping. Hence, we get $\theta_{k+1} = \theta_k – \alpha \nabla \mathcal{L}(\theta_k) $.

Example 2. $h(\theta) = \Sigma_i \{\theta_i \ln(\theta_i) – \theta_i\} $ where $\theta$ satisfies $\Sigma_i \theta_i = 1$.

\[\nabla h = [\ln(\theta_1), …, \ln(\theta_n)]^T, (\nabla h)^{-1} = \exp (\theta) \]

\[\langle \nabla h(\theta), \theta’ – \theta \rangle = \Sigma_i \frac{\partial h}{\partial \theta_i} (\theta’_i – \theta_i) = \Sigma_i (\ln \theta_i) (\theta’_i – \theta_i)\]

\[\Rightarrow D_h (\theta’||\theta) = \Sigma_i \{\theta’_i \ln(\frac{\theta’_i}{\theta_i}) – \theta’_i +\theta_i\} = \Sigma_i \theta’_i \ln(\frac{\theta’_i}{\theta_i})\]
using $\Sigma_i \theta_i = \Sigma_i \theta’_i = 1$. Note that with this $h$, Bregman Divergence converges to KL Divergence.
The update equation for mirror gradient descent becomes
\[\theta^{i}_{k+1} \leftarrow \exp(\ln(\theta_k) – \alpha \nabla \mathcal{L}(\theta_k)) = \theta_k \exp(-\alpha \nabla \mathcal{L}(\theta_k))\]

Typically, it’s good to use mirror descent if you are able to pick a good $h$ for that class of problems.

There are also many other variants of gradient descent. Each algorithm possesses unique strengths, you may try to find what suits the best for your own optimization problem. Understanding and utilizing these techniques expands our problem-solving capabilities across diverse fields. So, please feel free to embrace these algorithms and explore their potential in your own endeavors.

Conclusion

Projected and Mirror Gradient Descent are powerful tools for constrained optimization problems in ML. These algorithms are more computationally expensive that regular gradient descent due to their need to address the constraints. The Bregman Divergence is an expressive way to represent many forms of optimization (e.g., Examples 1 and 2). Constrained optimization is key in safety-critical domains, such as ML for robotics and healthcare where constraint satisfaction must be enforced. Note: check out our blog post on KKT conditions for an alternative approach for constrained optimization: The method of Lagrange Multipliers.

Acknowledgements

This blog post was developed in part with support from course materials available at http://www.cs.cmu.edu/afs/cs.cmu.edu/academic/class/15850-f20/www/notes/lec19.pdf

Bootcamp Summer 2020 Week 7 – Variational Inference in Heterogeneous Apprenticeship Learning

November 6, 2021 by Rohan Paleja and rliu302

Imagine we have an autonomous car that would like to perform speedy highway driving. There are several approaches to program such an autonomous vehicle. The first would be to hand-craft a rule-based system consisting of what to do in different scenarios. An example rule could be “if the autonomous car would like to move forward and there is a car in front and to the left, switch to the right lane”. A programmer could write down hundreds of rules consisting of every possible scenario. However, this process is expensive and leaves much to be desired. The second approach would be to utilize a driving simulator and design a reward function that defines the proper autonomous driving behavior. Then, one could use reinforcement learning to learn an optimal driving policy. However, defining such a reward function is difficult. Furthermore, if the driving simulator has differences with the real world, the behavior of the autonomous car may degrade when deployed. Lastly, one can utilize learning from demonstration (LfD), where user demonstrations are used directly to learn autonomous driving behavior. LfD has many benefits, including that it can learn human-like behavior and does not require a simulator. It is also much easier for a human to give an example of driving on the highway than defining all the rules used during driving. However, what happens when two drivers present demonstrations with conflicting behaviors (e.g., one prefers to pass on the left versus the other prefers to pass on the right)?

In this post, we are going to discuss how robots can learn from heterogeneous human demonstrators, demonstrators that will perform the same task in different ways.

This blog post will address the following questions in 3 parts:

What challenges arise from heterogeneity in demonstrations?
How can we automatically infer preferences within demonstrations?
How can we utilize person-specific embeddings in policy learning?

1. Challenges from heterogeneity of demonstrations

We can imagine multiple real-life scenarios, ranging from driving to the setting of a dinner table, in which there are heterogeneous approaches that can be utilized to reach the same goal. Continuing the autonomous driving example from the introduction, Figure 1 displays the heterogeneity that can occur within driving demonstrations. Within the example displayed, some subset of the demonstrators preferred to pass on the left while others preferred to pass on the right. LfD approaches that assume homogeneous demonstration will either fit to the mean (drive straight into the car in front) or fit a single-mode (only pass on one side), both producing suboptimal/limited driving behavior. It is important to model the heterogeneity within the Learning from Demonstration framework. We will accomplish this via variational inference for person-specific embeddings.

Heterogeneity in Highway Driving Demonstrations

The decision-making we employ to select our approach can be explicitly defined as a policy; a function that maps states $S$ to actions $A$ (Equation 1).

\begin{align}
\hat{\pi}: S \rightarrow A
\end{align}

Depending on an implicit demonstrator preference, the demonstrator may find it advantageous to choose one policy over the other. For example, a demonstrator setting the table may be left-handed and prefer to set a drinking glass on the left-hand side. Without explicitly labeling demonstrations by preference, the context within this selection process is lost to a robot attempting to learn the task, making it very difficult to learn a policy representation across all demonstrators. Without access to latent information about demonstrators, current LfD approaches sacrifice optimal behavior for consistency, weaken the robustness of the robot to different scenarios, and reduce the number of examples the robot can use to learn from.

As we consider scenarios where the number of demonstrations is numerous, it becomes advantageous to be able to autonomously infer distinct modalities without requiring someone at hand to label each demonstration. In addition, we would also like to learn a policy representing user demonstrations within our dataset. How can one accomplish both objectives, inferring latent information about demonstrated trajectories and learning a policy conditioned upon contextual and latent information, simultaneously?

2. Inferring trajectory modalities

We seek to autonomously infer distinct strategies or modes $\omega$ from the real-world demonstrations we are able to acquire. Following our autonomous driving example, Figure 2 displays the end result we seek to achieve: mode 1 demonstrations are assigned a specific latent code and mode 2 demonstrations are assigned a different latent code. Visually, in Figure 2 it is clear that there are only two types of demonstrations. We would like our approach to discover first, that there are different modes within the demonstration data-set and second, how many modes are present.

Inferred Demonstration Modalities in Highway Driving Demonstrations

To do so, we would like to maximize the mutual information $I$ between our embedding mode $\omega$ and the actions taken by our policy $\pi_\theta(a|\omega,s)$, where our policy is conditioned on the state and latent code. We display a representation of the mutual information in Equation 2, where H is the entropy function.

\begin{gather}
I(\omega;\pi(a|\omega,s)) = H(\omega) – H(\omega|\pi(a|\omega,s))
\end{gather}

Maximizing mutual information encourages $\omega$ to correlate with semantic features within the data distribution (i.e., mode discovery), which is exactly what we want! However, maximizing the mutual information between the trajectories and latent code is intractable as it requires access to the true posterior, $P(\omega|s,a)$, representing the distribution of modalities given trajectories. Instead, we can utilize a lower bound of the mutual information and maximize the lower-bound via gradient descent, thereby maximizing the mutual information and incentivizing the policy to utilize $\omega$ as much as possible.

Recall that the KL divergence is a measure of similarity between two probability distributions. We can express the KL divergence between $p$ and $q$ with Equation 3.

\begin{align}
D_{KL} \left( p (\omega | s, a) \ || \ q(\omega | s, a) \right) = \mathop{\mathbb{E}}_{\omega \sim p(\omega | s, a)} log \frac{p(\omega | s, a)}{ q(\omega|s,a)}
\end{align}

We display the derivation of the mutual information lower bound in Equation 4. Below, we provide a step-by-step walkthrough of the derivation.
\begin{align}
\begin{split}
I(\omega;\pi(a|\omega,s)) &= H(\omega) – H(\omega|\pi(a|\omega,s))) \\
&= H(\omega)-\mathop{\mathbb{E}}_{a \sim \pi(\cdot| \omega, s)}\big[\mathop{\mathbb{E}}_{\omega \sim p(\omega| s,a)} [- \log p(\omega|s,a)\big]] \\
&= H(\omega)+\mathop{\mathbb{E}}_{a \sim \pi(\cdot| \omega, s)}\big[\mathop{\mathbb{E}}_{\omega \sim p(\omega| s,a)} [\log p(\omega|s,a)\big]] \\
&= H(\omega)+D_{KL}(p(\omega|s,a)||q(\omega|s,a)) + \mathop{\mathbb{E}}_{a \sim \pi(\cdot| \omega, s)}\big[\mathop{\mathbb{E}}_{\omega \sim p(\omega| s,a)} [\log q(\omega|s,a)\big]] \\
&\geq H(\omega) + \mathop{\mathbb{E}}_{a \sim \pi(\cdot| \omega, s)}\big[\mathop{\mathbb{E}}_{\omega \sim p(\omega| s,a)} [\log q(\omega|s,a)\big]] \\
\end{split}
\end{align}

In line 1 of the derivation, we have Equation 2. To transition into line 2 of the derivation in Equation 4, we simplify the second entropy term. Line 3 of the derivation brings the negative sign outside of the expectation. Line 4 utilizes the identity $\mathop{\mathbb{E}}_{x \sim p_{X}(\cdot)}(p(x)) = D_{KL}(P||Q) + \mathop{\mathbb{E}}_{x \sim p_{X}(\cdot)}(q(x))$ in transitioning from line 3. Here, q is an approximate posterior for the true distribution of embeddings given trajectories. In transitioning to line 5, since the KL divergence between two distributions is always positive, we replace the equal sign with a greater than or equal to and remove the divergence term. We note that during optimization, we assume that the distribution of embeddings is a static, uniform distribution, and thus, the entropy across latent codes, $H(\omega)$ is constant and can be removed. Line 5 displays the final result, the lower bound of the mutual information.

Notice that we have now described a substitute objective function that is expressly in terms the latent mode variable $\omega$ and our approximate posterior $q$. Maximizing this objective should make different latent embeddings correspond with different policy behaviors. However, one problem remains: our computation of $q$ is based on the sampling over the expectation of the true posterior p($\omega$|s,a), which is still unknown. Therefore, we seek to gradually tease out the true distribution while training through sampling. We denote $\omega$, the discovered latent codes, as person-specific embeddings as each embedding displays latent features for a specific demonstrator.

3. Simultaneously learning a policy while inferring person-specific embeddings

Our goal is to learn demonstrator policies that are conditioned on person-specific embeddings, $\omega$. Thus, even if heterogeneity is present with the demonstrations, the latent encodings will be able to represent the unique behaviors among demonstrators. The resultant policy is able to adapt to a person’s unique characteristics while simultaneously leveraging any homogeneity that exists within the data (e.g., uniform adherence to hard constraints).

Accordingly, we will want to maximize our ability to imitate demonstrators via a behavior cloning loss and the mutual information between person-specific embeddings and trajectories via a variational information maximization (VIM) loss. We display a general overview of the architecture used during learning from heterogeneous demonstration in Figure 3.

Architecture to Learn from Heterogeneous Demonstrations.

In this figure, we have a policy represented by $\pi$ and parameterized by $\theta$, and an approximate posterior represented by $q$ and parameterized by $\phi$. We start with a set of person-specific embeddings initialized uniformly. As training proceeds, we infer the person-specific embeddings ($\omega$ is a learned latent encoding via gradient descent) and learn a policy representing all demonstrators. We display a sample algorithm in Figure 4 for learning $\theta$, $\phi$, and $\omega$.

Algorithm for Learning from Heterogeneous Demonstrations .

In step 1 of the algorithm, we sample from the set of demonstrators and obtain the trajectories associated with the sampled demonstrator. We initialize a new embedding, $\omega^{(i)}$, for the demonstrator. In step 2 and 3, we conduct a forward pass through the architecture displayed in Figure 3 and update parameters $\theta$ and $\omega$ via both the behavior cloning and VIM loss. In step 4, we utilize the VIM cost function to update $\phi$. We repeat this process until convergence. Once the algorithm has converged, every demonstrator’s person-specific embedding will accurately represent their modality. Utilizing this latent embedding within the learned policy will produce a representation of the demonstrator’s behavior.

In conclusion, the main takeaways from our blog post are as follows:

It is important to model the heterogeneity when Learning from Heterogeneous Demonstration (LfHD). We present a framework that can adapt to a person’s unique characteristics while simultaneously leveraging any homogeneity within the data.
We infer person-specific latent embeddings with semantic meaning by maximizing the lower bound of the mutual information between an inferred latent variable and a policy that is conditioned upon the latent variable.

Further Details

If you want to explore more on this, feel free to check our recent NeurIPS paper.

Bootcamp Summer 2020 Week 2 — Neural Networks and Backpropagation

January 18, 2021 by Rohan Paleja and Matthew Gombolay

Previously, we covered gradient descent, an optimization method that adjusts the values of a set of parameters (e.g., the parameters of a linear regression model) to minimize a loss function (e.g., the mean squared error of a linear regression model network in predicting the value of a home given its zip code and other information). In this post, we introduce neural networks, which are essentially hyperparameter function approximators that learns to map a set of inputs to a set of outputs.

To optimize a neural network model with respect to a given loss function, we could directly apply gradient descent to adapt the parameters of a neural network to learn a desired mapping of inputs to outputs, this process would be inefficient. Due to the large number of parameters, performing symbolic differentiation as introduced in our gradient descent lesson would require a lot of redundant computation and slow down the optimization process tremendously. Fortunately, there is a better way: the backpropagation algorithm. Backpropagation is a form of auto-differentiation that allows us to more efficiently compute the derivatives of the neural network (or other model’s) outputs with respect to each of its parameters. Backpropagation is often overlooked or misunderstood as a simple application of symbolic differentiation (i.e., the chain rule from Calculus), but it is much, much more.

This blog will cover

Neural Networks
Forward Propagation
Backpropagation

We will not take the time to discuss the biological plausibility of neural networks, but feel free to check out this paper.

Neural Networks

We will start by understanding what a neural network is. At its core, a neural network is a function approximator. Functions from algebra, such as $y=x^2$ and $y=2x+1$, can be represented by neural networks. Neural networks can be used to learn a mapping given a set of data. For example, a neural network could be used to model the relationship of the age of a house, the number of bedrooms, and the square footage with the value of a house (displayed in Figure 1).

A sample relationship a neural network can be expected to learn.

While the example in Figure 1 may appear trivial, neural networks can solve much more challenging problems. For example, neural networks have been applied to take as input Magnetic Resonance Imaging (MRI) data and output diagnoses for the patient based upon what is present in the MRI. Neural networks have also had success in being applied to problems of “machine translation” in which one attempts to translate one language to another (e.g., English to Chinese).

Typically, neural networks are given a dataset $\mathcal{D}=\{(x_k,y_k),\ k\in [n]\}$, where $(x_k,y_k)$ is the “example” number $k$, $x_k$ is the input to a model we want to train, and $y_k$ is the “label” we want our model to learn to predict from $x_k$. We generalize the notation to $x$ being the input data, and $y$ being the set of ground-truth labels. A neural network can be represented as a function $\hat{y} = f(x)$ (i.e., $f: X \rightarrow Y$). Here, we use “$\hat{y}$” as the output of the neural network instead of “$y$,” as $\hat{y}$ is our estimate of the ground-truth label, $y$. Later, we will use the difference (i.e., error) between $\hat{y}$ and $y$, to optimize our neural network’s parameters.

If we look inside the function $f$, we see nodes and edges. These nodes and edges make up a directed graph, in which an input is operated on from left to right.

A more colorful inside look into a neural network. The output node in this example neural network uses a linear activation, denoted by the diagonal line (“/”).

A closer look inside a smaller neural network is shown in Figure 2. Figure 2 displays how an input, $x$ (with cardinality $|x|=d$), is mapped to an output, $\hat{y}$, a scalar. Before we move into what is happening here, let’s will first explain the notation.

We have neural network weights, denoted by $w$, that weight the incoming value through multiplication. The subscripts represent the (to, from) nodes respectively, and the superscript denotes the layer number resulting in the notation as $w^{(layer)}_{to,from}$. The terms with the number “1” inside are termed bias nodes and are represented in the notation “b_{to}”.

As an example, we present a simple neural network describing $\hat{y} = 2x + 4z +3$ in Figure 4.

An example neural network representing $\hat{y}=2x + 4z +3$

When an input is passed into a neural network, the input $x$ is multiplied by respective weights (denoted by $w$) and added to by a bias term $b$, resulting in a new value represented within a hidden node. In our model, we set the number of nodes in each layer to $n$. Each node, along with its incoming edges, represents some inner function. As our neural network can have many nodes and many layers, neural networks can be thought of as a composition of these inner functions to produce a more complex, outer function $f$ described above. The backslash in the most-right node represents a summation, resulting in a scalar value output. Now we will discuss precisely how to produce an output $\hat{y}$ given an input $x$ through learning a mapping.

Forward Propagation

The final output $\hat{y}$ is computed from the node values and weights of the previous layer, in our case, layer 2. We can multiply the output of the hidden nodes in layer 2, denoted by $o^{(2)}_i$, where $i$ represents a node index from $i$ to $n$ (i.e., $i \in \{1,2,\ldots,n\}$, by the weights connecting these nodes with the output node and add the bias term b^{(3)}_{i}. We display this process in Equation 1.

\begin{align}
\hat{y} = \sum_{i=1}^n w_{1,i}^{(3)} o^{(2)}_i + b^{(3)}_{1}
\end{align}

But, what is $o_i^{(2)}$ equal to? We display how $o_i^{(2)}$ is computed in Equation 2.

\begin{align}
o_i^{(2)} = g(z_i^{(2)})
\end{align}

Here, $i$ is again the node index and $g$ is an activation function (i.e., $g: Z \rightarrow O$). Neural networks cannot represent nonlinear functions without utilizing nonlinear activation functions. Activation functions help neural networks represent more complex functions by transforming node values by a nonlinear function. Common activation functions used include ReLU, tanh, sigmoid, etc. Here, we will discuss and use the ReLU (Rectified Linear Unit) as our activation function.

For our example, we will use the ReLU activation function, as shown in Equation 3.

\begin{equation}g(z) =\bigg\{\begin{aligned} &z \text{ if }z \geq 0\\ &0 \text{ otherwise}\end{aligned}\bigg\}\end{equation}

It is also helpful to know the gradient of the ReLU function, displayed in Equation 4.

\begin{equation}g'(z) =\bigg\{\begin{aligned} &1 \text{ if }z \geq 0\\ &0 \text{ otherwise}\end{aligned}\bigg\}\end{equation}

Moving back to Equation 2, we solve for $z_i^{(2)}$ in Equation 5.

\begin{align}
z^{(2)}_i = \sum_{j=1}^n w_{i,j}^{(2)} o^{(1)}_j + b^{(2)}_{i}
\end{align}

This procedure can be used throughout the neural network, computing the current layer using the previous layer’s output. For simplicity of algorithmic notation, you can consider the neural network’s input, $x$, as $o^{(0)}$. Here, $o^{(0)}$ represents the hypothetical output of a hypothetical zeroth layer. In Algorithm 1, we display the complete forward propagation algorithm, expressing the layer number as a variable for generalizability. This algorithm computes a predicted output, represented by $\hat{y}$, given input features, $x$.

We can now analyze the computational complexity of this algorithm. For the sake of simplifying the analysis, we assume the number of layers in the neural network, $l$, is equal to the number of nodes in each layer, $n$, which we also set as equal to the size of the input, $|x|=d$. Thus, $n=l=w$. In Algorithm 1, Steps 3 and 5 represent a matrix multiplication combined with addition. This process has a complexity of $O(n^2+n)$. As a reminder, for any matrix multiplication of two matrices of size $p \times q$ and $q \times r$, respectively, the computational complexity of this multiplication is $O(pqr)$. Thus, since we are multiplying weight matrices of size $n \times n$ to a vector of size $n \times 1$, the complexity of the matrix multiplication alone is $O(n \times n \times 1) = O(n^2)$. The addition operation adds complexity $O(n)$, resulting in $O(n^2+n)$ for Steps 2-6. Steps 8-10 have a complexity of $O(n)$ leading, resulting now in the complexity $O(n^2 + 2n)$. Since Step 1 iterates $n=l$ times, we arrive at a total complexity of $O(n(n^2 + 2n))=O(n^3)$ for Algorithm 1.

Now that we can perform a forward pass and know its complexity, we can discuss how to update our weights and attempt to learn a model that fits the data.

Backpropagation

To optimize our neural network weights to more accurately represent our data, we first establish a loss function that quantifies our model’s error (i.e., how wrong its outputs are for a desired task). Given this loss function, we could then compute the gradients of our loss function with respect to each weight. We could then perform gradient descent to optimize our model parameters as we did in the gradient descent blog post. However, we will now do something a bit more sophisticated to make the optimization process more computationally efficient.

Let’s consider a common loss function, as shown in Equation 6. This loss function represents a mean-squared error loss between the true label (true y-value in the data) and the predicted value using our neural network ($\hat{y}$).

\begin{align}
L=\frac{1}{2} (y-\hat{y})^2 = \frac{1}{2} (y-f(x))^2
\end{align}

The derivative of this loss function with respect to $w_{1,1}^{(1)}$ is displayed in Equation 7 according to chain rule.

\begin{align}
\frac{\partial L}{\partial w_{1,1}^{(1)}} = (y-\hat{y}) * \frac{\partial \hat{y}}{\partial w_{1,1}^{(1)}} \bigg|_x
\end{align}

So, how do we comput $\frac{\partial \hat{y}}{\partial w_{1,1}^{(1)}}$? There are 4 paths from the output node to $w_{1,1}^{(1)}$, which highlighted in blue in Figure 4. The upper-most path has the gradient shown in Equation 8.

\begin{align}
\frac{\partial \hat{y}}{\partial w_{1,1}^{(1)}} = \frac{\partial \hat{y}}{\partial o^{(3)}} \frac{\partial o^{(3)}}{\partial z^{(3)}} \frac{\partial z^{(3)}}{\partial o^{(2)}_1} \frac{\partial o^{(2)}_1}{\partial z^{(2)}_1} \frac{\partial z^{(2)}_1}{\partial o^{(1)}_1} \frac{\partial o^{(1)}_1}{\partial w_{1,1}^{(1)}}
\end{align}

The rule for computing the “total derivative” encompassing all of these paths can be seen through an example here.

A depiction of neural network dependencies on $w_{1,1}^{(1)}$.

Computing a single partial derivative is computationally expensive, and computing each weight’s partial derivative individually (e.g., (e.g., $\frac{\partial \hat{y}}{\partial w_{1,1}^{(1)}}$ and then $\frac{\partial \hat{y}}{\partial w_{1,2}^{(1)}}$) and so forth all the way to $\frac{\partial \hat{y}}{\partial w_{1,n}^{(l)}}$) would be terribly inneficient.

If we instead look at $\frac{\partial \hat{y}}{\partial w_{1,2}^{(1)}}$, you can see that it shares many of the same nodes as $\frac{\partial \hat{y}}{\partial w_{1,1}^{(1)}}$. In Figure 5, we display a depiction of gradient computations required for $w_{1,1}^{(1)}$, $w_{2,1}^{(1)}$, and the overlap in the computations.

A depiction of neural network dependencies on $w_{1,1}^{(1)}$ (left), neural network dependencies on $w_{2,1}^{(1)}$ (middle), and the overlap between depedencies.

We would like to take advantage of this overlap in computation, which is exactly where the Backpropagation algorithm comes in. Backpropagation is a form of reverse-mode automatic differentiation that allows us to eliminate redundant computation in applying the chain rule to compute the gradient of our neural network.

Applied to neural networks, the backpropagation algorithm starts from the last layer of the neural network, computes an error term (denoted as $\delta$), and multiplies this error term by the gradient. The algorithm computes each node’s respective gradient in memory and reuses these variables as the algorithm continues, backpropagating the information for efficiency. We display the algorithm for Backpropagation in Algorithm 2. The output of the Backpropagation algorithm is gradients for each set of weights and biases.

Looking at the computational complexity of this algorithm, we see Step 4 has a complexity of $O(1)$. Step 6 has a complexity of $O(n^2 + n)$ or $O(n^2)$. Step 10 has a complexity of $O(n)$. Steps 12 and 14 both have complexities of $O(n^2)$. Since we iterate over the number of layers, equal to $n$ for our analysis, the total complexity is $O(n^3)$. If we did not use the Backpropagation algorithm, we would end up redundantly computing each component’s derivatives multiple times. Using the previous assumptions setting the cardinality of $|x|$, the number of nodes, and the number of layers to $n$, we would be iterating over $O(n^3 + n^2)$ weight derivatives resulting in a total complexity of $O(n^6)$.

The last step in optimizing our neural network model to minimize our loss function is to update weights and biases. Equations 9-10 show one step of gradient descent using the derivatives from our backpropagation procedure in Algorithm 2. Note that Equations 9-10 represent only one step of gradient descent, so one would need to iteratively perform the forward pass, then backward pass, then parameter updates (Equations 9-10) until the model converges to a local minimum, as described in our previous post on gradient descent.

\begin{align}
w^{(l)} \to w^{(l)} + \alpha \Delta w^{(l)}
\end{align}
\begin{align}
b^{(l)} \to b^{(l)} + \alpha \Delta b^{(l)}
\end{align}

We now have the ability to train neural networks in a computationally efficient manner, which, in turn, allows us to do more training iterations per unit time, which should help us train more accurate neural networks!

Key Points throughout this Blog:

Neural networks can be used to represent almost any function.
Forward propagation is the process of taking input features, $x$, and computing an output in the form of a model’s estimate, $\hat{y}$.
We can use the difference in the output, $\hat{y}$, of the neural network and our ground-truth data, $y$, to compute a loss, and then use this loss value to compute gradients in the backpropagation algorithm.
Backpropagation is a special case of reverse-mode automatic differentiation and allows for efficient gradient computation. The result is that the training of extremely large neural networks is possible!

Joint Goal and Strategy Inference across Heterogeneous Demonstrators via Reward Network Distillation

August 16, 2020 by Zac Chen, Rohan Paleja, Leng and Matthew Gombolay

“Learning from Demonstration” (LfD) techniques aim to enable robots to be taught by non-roboticists how to perform tasks, such as folding laundry, driving a car, or performing surgery. For decades, people have relied on the incorrect assumption that every teacher performs tasks similarly, i.e. everyone folds laundry or performs surgery in a similar fashion. However, this assumption is often not true. Humans (and even experts) typically adopt heuristics (i.e., “mental shortcuts”) to solve challenging optimization problems due to a human’s limited working memory. These highly-refined strategies present heterogeneity in behavior across human task demonstrators. Only recently have researchers sought to relax this assumption of demonstrator homogeneity. However, even these new approaches do not explicitly tease out precisely what was similar in all the demonstrations versus what was unique to each teacher. Rather, some previous methods divide the demonstrations into clusters beforehand and perform LfD for homogeneous demonstrations within one cluster. However, this limits the number of data accessible to LfD algorithms, resulting in worse learning.

We take a new perspective in LfD by modeling every human teacher as having a shared notion of the desired task (a common goal) along with a teacher-specific notion (e.g., strategy, or style) for performing the task. Accordingly, we present a novel method, Multi-Strategy Reward Distillation (MSRD), a novel Inverse Reinforcement Learning (IRL) technique that automatically finds and takes advantage of the common goal shared by all teachers (e.g., perform the task well) while teasing out and accounting for what makes each teacher unique! This blog illustrates how MSRD works through five parts:

Preliminaries to Understand MSRD
Method of MSRD
Simulation Performance of MSRD
Table Tennis Real-World Experiment of MSRD
Conclusion

Preliminaries

In this section, we do a quick overview of preliminaries.

What is Reinforcement Learning?

We briefly recap the principals of Markov Deicison Process (MDP) and Reinforcement Learning (RL) in this section via a robot navigation task example (shown in Figure 1).

Robot Maze Example — This figure depicts a robot navigation task example.

In this example, the task for the robot is to navigate to the star (right bottom corner) starting from the top left corner. Mathematically, we could model the process as an MDP described by a 6-tuple $(S,A,R,T,\gamma,\rho_0)$.

$S$ represents the state space. In our example, a state $s\in S$ could be which grid the agent is standing on.
$A$ represents the action space. In the example, an action $a\in A$ could be trying to move to one of the four adjacent grid.
$\gamma\in (0,1)$ is the temporal discount factor, representing the relative preference towards instant reward versus long-term reward.
$R(s,a)$ represents the reward after executing action $a$ in the state $s$. In some cases, $R(s,a)$ could be simplified as $R(s)$. In our example, $R(s)$ is $0$ everywhere except for $+1$ for the star position.
$T(s,a,s^\prime)$ is the transition probability to $s^\prime$ from state $s$ after taking action $a$. For example, for the initial state and action “go right”, the agent will move one step right with probability $1$ if the robot is prefect. But if the robot is imperfect (which is always the case!), it is possible that the robot will instead go down with a probability $0.1$.
$\rho_0(s)$ is the initial state distribution. In the example case, $\rho_0(s)$ has probability $1$ to be in the top left corner.

The standard goal of reinforcement learning is to find the optimal policy $\pi^*(a|s)$ that maximizes the discounted future reward, shown in Equation \ref{eq:rl_objecive}.
\begin{align}
J(\pi)=\mathbb{E}_{\tau\sim \pi}\left[\sum_{i=0}^T{\gamma^tR(s_t,a_t)}\right]
\label{eq:rl_objecive}
\end{align}
$\tau=(s_0,a_0,\cdots,s_T,a_T)$ denotes a sequence of states and actions induced by the policy and dynamics: $s_1\sim\rho_0(s)$, $a_t\sim\pi(\cdot |s_t)$, $s_{t+1}\sim T(s_t,a_t,\cdot)$, $t\in\{1,2,\cdots,T\}$.

We instead consider a more general maximum entropy objective which augments the standard objective with an entropy bonus to favor stochastic policies and to encourage exploration during optimization by [Zierbart 2008], shown in Equation \ref{eq:rl_me_objective}.
\begin{align}
J(\pi)=\mathbb{E}_{\tau\sim \pi}\left[\sum_{i=0}^T{\gamma^tR(s_t,a_t)}+\alpha H(\pi(\cdot | s_t))\right]
\label{eq:rl_me_objective}
\end{align}

What is Inverse Reinforcement Learning?

Inverse Reinforcement Learning (IRL) considers an MDP sans reward function ($M\backslash R$) with the goal being to infer reward function $R(s,a)$ given a set of demonstrated trajectories $\mathcal{D}=\{\tau_1, \tau_2, \cdots, \tau_N\}$. For example, in Figure 1, we draw two demonstrated trajectories $\tau_1$ and $\tau_2$. A typical assumption for IRL is that demonstrated trajectories are optimal, or at least near-optimal or stochastic optimal. In the case of our example ($R(s)$ is $0$ everywhere except for $+1$ for the star position), both trajectories are optimal, i.e., obtaining the $+1$ reward with the least time.

Adversarial Inverse Inforcement Learning (AIRL) casts the problem of LfD in a generative adversarial framework, learning a discriminator, $D$, to distinguishes between experts and a deceiving generator, $\pi$, that learns to imitate the expert. This framework follows Maximum-Entropy IRL’s assumption that trajectory likelihood is proportional to the exponential of trajectory rewards, i.e. $p(\tau)\propto \sum_{t=1}^T{\gamma^tr_t}$. The Discriminator $D$ is defined in Equation \ref{eqn:D_and_f}, where $f_\theta(\tau)$ is the learnable reward function and $\pi$ is the current policy. $D$ is updated by minimizing its cross entropy loss to distinguish expert trajectories from generator policy rollouts (Equation \ref{eqn:airl_loss}). Generator $\pi$ is trained via RL (Equation \ref{eq:rl_me_objective}) to maximize the pseudo-reward function given by $f(s,a)$.
\begin{equation}
D_\theta(s,a)=\frac{\exp(f_\theta(s,a))}{\exp(f_\theta(s,a))+\pi(a|s)}
\label{eqn:D_and_f}
\end{equation}

\begin{align}
L_D=-\mathbb{E}_{\tau\sim\mathcal{D}, (s,a)\sim\tau}[&\log D(s,a)]-\mathbb{E}_{\tau\sim\pi, (s,a)\sim\tau}[1-\log D(s,a)]
\label{eqn:airl_loss}
\end{align}

TL;DR: IRL algorithms (including AIRL) learn a reward function that could explain demonstration behavior. Further, IRL trains a policy that maximizes the learned reward, which should behave similarly to the demonstration.

Multi-Strategy Reward Distillation Method

In MSRD, we consider heterogeneous demonstration dataset, $\mathcal{D}^{(i)}=\{\tau_1^{(i)},\tau_2^{(i)}, \cdots, \tau_M^{(i)}\}$, where $i\in\{1,2,\cdots, N\}$ is strategy index, $M$ is the number of demonstration trajectories for one strategy, and $N$ is the number of strategies. In our robot navigation example, $\mathcal{D}^{(1)}=\{\tau_1^{(1)}\}, \mathcal{D}^{(2)}=\{\tau_1^{(2)}\}$, i.e., there are two different strategies, and each strategy has one demonstration.

The core assumption of MSRD is fairly straightforward: the reward function in each demonstration is optimizing is a linear combination of the task reward ($R^{(0)}$) and the strategy-only reward as given by Equation \ref{eq:msrd_assumption}. We call $\widetilde{R}^{(i)}$ strategy-only reward. When strategy-only reward is combined with task reward, it becomes strategy-combined reward ($R^{(i)}$), meaning one wants to finish the task as well as maintaining the strategy.
\begin{align}
R^{(i)}(\cdot) &= R^{(0)}(\cdot) + \alpha_i \widetilde{R}^{(i)}(\cdot)
\label{eq:msrd_assumption}
\end{align}

Our approach is unique in that it shares the task reward with all the demonstrations while having the ability to tease out demonstrator-specific nuances (i.e., heterogeneity).

msrd_diagram — This diagram shows how MSRD learns.

To infer the shared task reward function, we propose utilizing network distillation to distill common knowledge from each separately learned strategy-combined reward $R^{(i)}$ to the task reward function $R^{(0)}$. The learning diagram is shown in Figure 2. We also want to regularize $R^{(i)}$ to be close to $R^{(0)}$, since we have the prior knowledge that despite heterogeneous strategic preferences, all experts should still be prioritizing optimizing the task reward $R^{(0)}$ to achieve at least near-optimal performance. We propose to regularize the expected L2-norm of the difference between the reward functions, as shown in Equation \ref{eqn:l2}, in which $\pi^{(i)}$ is the optimal policy under reward function $R_{\theta_i}^{(i)}$.
\begin{align}
\label{eqn:l2}
L_\text{reg}=\mathbb{E}_{(s,a)\sim \pi^{(i)}}\left({\left\lvert\left\lvert R_{\theta_i}^{(i)}(s,a)-R_{\theta_0}^{(0)}(s,a)\right\rvert\right\rvert}_2\right)
\end{align}
Note that we are using an index both on $\theta$ and $R$ to denote that each strategy-combined reward $R^{(i)}$ has its own reward parameters, and that these are approximated by separate neural networks with parameters $\theta_i$ for each strategy and $\theta_0$ for the task reward. There is no parameter sharing between strategy and task reward.

To avoid computational cost to fully optimize $\pi^{(i)}$ and $R^{(i)}$, we apply an iterative reward function and policy training schedule similar to AIRL. Combining AIRL’s objective (Equation \ref{eqn:airl_loss}) with the distillation objective, we want to maximize $L_D$ in Equation \ref{eqn:vanilla_distillation}.
\begin{align}
\label{eqn:vanilla_distillation}
L_D=\sum_{i=1}^N&\left[{\mathbb{E}_{(s,a)\sim \tau_j^{(i)}\sim \mathcal{D}^{(i)}}{\log D_{\theta_i}(s,a)}}
+ {\mathbb{E}_{(s,a)\sim \pi^{(i)}}{\log (1-D_{\theta_i}(s,a))}}
– {\mathbb{E}_{(s,a)}\left({\left\lvert\left\lvert R_{\theta_i}^{(i)}(s,a)-R_{\theta_0}^{(0)}(s,a)\right\rvert\right\rvert}_2\right)}\right]
\end{align}
Yet, while Equation \ref{eqn:vanilla_distillation} should be able to distill the shared reward into $R_{\theta_0}^{(0)}$, the distillation is inefficient as $R_{\theta_0}^{(0)}$ will work as a strong regularization for $R_{\theta_i}^{(i)}$ before successful distillation.

Instead, our structure in Equation \ref{eq:msrd_assumption} allows for a two-column (reference) re-parameterization, speeding up knowledge transfer and making the learning process easier. Combining Equation \ref{eq:msrd_assumption} and Equation \ref{eqn:vanilla_distillation}, we arrive at Equation \ref{eqn:reparameterized_distillation}.

\begin{align}
\label{eqn:reparameterized_distillation}
L_D&=\sum_{i=1}^N\left[{\mathbb{E}_{(s,a)\sim \tau_j^{(i)}\sim \mathcal{D}^{(i)}}{\log D_{\theta_i,\theta_0}(s,a)}}
+ {\mathbb{E}_{(s,a)\sim \pi^{(i)}}{\log \left(1-D_{\theta_i,\theta_0}(s,a)\right)}}
– \alpha_i{\mathbb{E}_{(s,a)}\left({\left\lvert\left\lvert \widetilde{R}_{\theta_i}^{(i)}(s,a)\right\rvert\right\rvert}_2\right)}\right]
\end{align}

The key difference between Equation \ref{eqn:reparameterized_distillation} and Equation \ref{eqn:vanilla_distillation} is that $D$ depends on both $R_{\theta_0}^{(0)}$ and $\widetilde{R}_{\theta_i}^{(i)}$ instead of separate $R_{\theta_i}^{(i)}$. Thus, $R_{\theta_0}^{(0)}$ directly updates from the discriminator’s loss rather than waiting for knowledge to be learned by a strategy-combined reward and subsequently distilled into a task reward.

Further, the last term of Equation \ref{eqn:reparameterized_distillation} reduces to a simple L2-regularization on strategy-only reward’s output, weighted by $\alpha_i$. This formulation provides us with a new view to interpret the relative weights of the strategy-only reward $\alpha_i$: the larger $\alpha_i$ is, the more the strategy-only reward will influence the strategy-combined reward. Therefore, we will have higher regularization to account for possible overwhelming of the task reward function. Comparing Equation \ref{eqn:reparameterized_distillation} and \ref{eqn:airl_loss}, we could interpret MSRD in another view: optimizing $\theta_i$ only via IRL objective results in a combination of task and strategy reward, and adding regularization on the strategy reward will encourage to encode only necessary information in $\theta_i$ and share more knowledge in $\theta_0$.

Simulation Results

We tested MSRD on two simulated environments: inverted pendulum and hopper. The goal of the inverted pendulum task is to balance a pendulum on a cart by moving the cart left/right. The reward for the inverted pendulum is defined as the negative absolute value of the pendulum angle from the upright position. The objective of hopper is to control 3-DoF joints to move forward. The reward for hopper is defined as the speed at which it moves forward. We removed termination judgments to gain flexibility in behaviors.

We first generated a variety of demonstrations to emulate heterogeneous strategies that humans will apply in solving problems for our virtual experiments via “DIAYN + Extrinsic Reward” and a novel method called “KL-Encouraged + Extrinsic Reward”. The two methods essentially encourage the policies trained to be different from each other while still accomplishing the task goal. We generated 20 different strategies for Inverted Pendulum and 2 most significant strategies for hopper, shown in Figure 3 and Figure 4.

inverted_pendulum_skill_17 — These gif show different strategies of Inverted Pendulum.

inverted_pendulum_skill_6 — These gif show different strategies of Inverted Pendulum.

hopper_hop — These gif show different strategies of Hopper.

hopper_crawl — These gif show different strategies of Hopper.

Hypotheses

We explore two hypotheses to elucidate MSRD’s advantage on recovering both the latent task reward (the essential goal of the demonstrators) and the means by which the task is accomplished (i.e., the strategy):

H1: The task reward learned by MSRD has a higher correlation with the true task reward than AIRL.
H2: Strategy-only reward learned by MSRD has a higher correlation with true strategic preferences than AIRL.

We assessed both hypotheses quantitatively and qualitatively for the simulation environments as the ground-truth reward functions are available.

To test H1, we constructed a dataset of trajectories that have various task performances. We then evaluated the reward function learned by AIRL and MSRD on the trajectories, comparing estimated vs. ground-truth rewards. We show a correlation of estimated rewards and ground-truth task rewards in Figure 5. The task reward function learned through MSRD has a higher correlation with the ground-truth reward function (0.99 and 0.94) versus AIRL (0.51 and 0.89) for each domain, respectively. AIRL’s reward function overfits to some strategies and mixes the task reward with that strategy-only reward, making its estimation unreliable for other strategies’ trajectories.

msrd_task_reward_correlation_vs_airl — This figure shows the correlation between ground-truth and estimated task reward, normalized to $[0,1]$ for inverted pendulum (left) and hopper(right). $r$ is the correlation coefficient.

To test H2, we calculated the correlations of MSRD’s strategy-only rewards with the true strategic preferences and compared that with the correlation of AIRL’s rewards when AIRL is trained on each individual strategy. In simulated domains, true strategic preferences are available. Correlations of both methods for all strategy rewards in inverted pendulum are shown in Figure 6. A paired t-test shows that MSRD achieves a statistically significantly higher correlation (M = 0.779, SD = 0.239) for the strategy rewards versus AIRL (M = 0.635, SD = 0.324) trained separately for each strategy, $t(19) = 1.813$, $p = 0.0428$ (one-tailed). A Shapiro-Wilk test showed the residuals were normally distributed ($p = 0.877$). For the hopper domain, MSRD achieved $0.85$ and $0.93$ correlation coefficient for the hop and crawl strategy, compared with AIRL’s $0.80$ and $0.82$ respectively.

msrd_strategy_reward_comparison_vs_airl — This figure shows the correlation between ground-truth and estimated strategy reward on Inverted Pendulum.

We further show qualitative results for the learned strategy reward. For Inverted Pendulum, we fix three dimensions (cart position, cart velocity and pendulum angular velocity) to zero and investigate the reward change within the one remaining dimension (pendulum angle). The relationship between rewards and pendulum angles in task and strategy reward functions are illustrated in Figure 7, in which the task reward function reaches its peak when the angle of the pendulum is near zero. This precisely recovers the task reward. For strategy-only reward functions, strategy 13 encourages the pendulum to lean left, while strategy 4 encourages the policy to tilt the pendulum right. Therefore, strategy-only rewards learned by MSRD captures specific preferences within demonstrations. Figure 7 also shows the magnitude of the task reward is larger than the strategy reward, which affirms our expectation that an emphasis is being put towards accomplishing the task.

inverted_pendulum_reward_illustration — This figure illustrates the learned strategy reward on Inverted Pendulum.

In the hopper environment, it is harder to visualize the reward landscape due to a high-dimensional observation space and a lack of interpretability of states. Therefore, instead of visualizing a reward curve, we evaluate the estimated strategy-only reward on trajectories from both strategies. Figure 8 shows that when given a hopping trajectory, the hop strategy-only reward function gives higher reward for that behavior than crawl strategy-only reward function. Similarly, in the crawl trajectory case, the crawling strategy-only reward gives a higher value than the hop strategy-only reward. Therefore, the strategy-only reward function recovered by MSRD gives a higher reward to the corresponding behavior than the other strategy-only reward function, thus providing encouragement to the policy towards the intended behavior.

hopper_strategy_reward_illustration — This figure shows the strategy rewards evaluated on different strategy demonstrations on Hopper.

These results across our simulated environments show our algorithms’ success in both task reward recovery and strategy reward decomposition. This capability is a novel contribution to the field of LfD in that we are able to tease out strategies from the underlying task and effectively learn policies that can reproduce both the strategy and a well-performed policy for the underlying task.

Robot Table Tennis Results

We tested MSRD on its capability to learn various table tennis strokes from expert human demonstration. Participants were asked to kinetically teach a robot arm to hit an incoming ping pong ball using three different stroke strategies: push, slice, and topspin. The change in angle and the upward/downward motion of the paddle throughout the policy trajectory are key factors in the display of different strategies (as these induce spin). Push is associated with a small change in angle as it is not attempting to add any spin onto the ball. Slice places a backspin on the ball, and thus the angle of the paddle will quickly tilt up. Conversely, to achieve a topspin on the ball, the associated trajectory has a quick upward motion.

After just 30 minutes of training on each strategy, the robot was able to learn to strike $83\%$ of the fed balls, shown in Figure 9. The robot learned to perform all strategies, and the robot’s best strategy, topspin, resulted in $90\%$ of balls returned successfully.

push_1 — These gif show three strategies of Table Tennis strikes (from top to bottom: push, slice, topspin).

push_2 — These gif show three strategies of Table Tennis strikes (from top to bottom: push, slice, topspin).

Conclusion

We explored the important case of IRL when demonstrators provide heterogeneous, strategy-based demonstrations besides seeking to maximize a latent task reward. By explicitly modeling the task and strategy reward separately, and jointly learning both, our algorithm is able to recover more robust task rewards, discover unique strategy rewards, and imitate strategy-specific behaviors.