Matthew Gombolay

AAMAS 2022 Tutorial – Graph Neural Networks: Learning Representations of Robot Team Coordination Problems

January 18, 2022 by Matthew Gombolay and Zheyuan Wang

Tutorial at the International Conference on Autonomous Agents and Multi-Agent Systems 2022

Overview

Robot teams are increasingly being deployed in environments, such as manufacturing facilities and warehouses, to save cost and improve productivity. To efficiently coordinate multi-robot teams, fast, high-quality scheduling algorithms are essential to satisfy the temporal and spatial constraints imposed by dynamic task specification and part and robot availability. Traditional solutions include exact methods, which are intractable for large-scale problems, or application-specific heuristics, which require expert domain knowledge. What is critically needed is a new, automated approach that automatically learn lightweight, application-specific coordination policies without the need for hand-engineered features.

This tutorial is an introduction to graph neural networks and a showcase of the power of graph neural networks solving multirobot coordination problems. We survey various frameworks of graph neural networks in recent literature, with a focus on their application in modeling multi-agent systems. We will introduce the multi-robot coordination (MRC) problems and review the most relevant methods available to solving MRC problems. We will discuss several successful applications of graph neural networks in MRC problems, with hands-on tutorials in the form of Python example code. With this tutorial we aim at providing an experience of employing graph neural networks in modeling multi-robot systems, from algorithm development to code implementation, thus opening up future opportunities in designing graph-based learning algorithms in broader multi-agent research.

Content

In this tutorial, we will discover the power of graph neural networks for learning effective representations of multi-robot team coordination problems. The tutorial will feature two 90-minute online sessions.

Date and Time: May 8, 11:00 – 14:00 (EDT)

3:00 – 6:00, May 9, Auckland Time
11:00 – 14:00, May 8, New York Time
17:00 – 20:00, May 8, Paris Time
20:30 – 23:30, May 8, Kolkata Time
23:00 – 2:00 (+1 day) May 8, Beijing Time

Zoom link: https://gatech.zoom.us/j/93966169541 [Video recordings are posted below]

Session #1 by Matthew Gombolay (11:00 – 12:30)

The first session will address the following: (a) How graph neural networks work – we will present a comprehensive overview of various graph neural networks proposed in prior literature, including both homogeneous and heterogeneous graphs and attention mechanisms; (b) How to model team coordination problems with graph neural networks – we will discuss what are the suitable applications that can be modeled in graph neural networks, with a focus on MRC problems; (c) How to optimize the parameters of graph neural networks for team coordination problems – we will discuss what learning methods can be used for training a graph neural network-based solver. We conclude this session with the most recurrent challenges and open questions.

Tutorial notes: [Slides]

Video:

Session #2 by Zheyuan Wang (12:30 – 14:00)

The second session will provide a hands-on tutorial for how to get up and running with graph neural networks for coordination problems, with coding examples in Python Jupyter notebook. In particular, we will look into the ScheduleNet architecture [6], a heterogenous graph neural network-based solver for MRC problems under temporal and spatial constraints. The Jupyter notebook will work through the model implementation, training and evaluation of ScheduleNet models on synthetic dataset.

Python Jupyter notebook: [Open in Google Colab] [Github]

Video:

Target Audience

The target audience is intended to be students and researchers who have a strong grasp of machine learning but who may as of yet be unfamiliar with graph neural networks. Prior knowledge of Multi-agent Reinforcement Learning (MARL) or Planning & Scheduling would be helpful but is not necessary. While the tutorial showcases the application of graph neural networks in solving multi-robot coordination problems, the methodology introduced can be utilized by the audience to broader research problems in learning representations of multi-agent systems.

Presenters

Dr. Matthew Gombolay is an Assistant Professor of Interactive Computing at the Georgia Institute of Technology. He received a B.S. in Mechanical Engineering from the Johns Hopkins University in 2011, an S.M. in Aeronautics and Astronautics from MIT in 2013, and a Ph.D. in Autonomous Systems from MIT in 2017. Gombolay’s research interests span robotics, AI/ML, human–robot interaction, and operations research. Between defending his dissertation and joining the faculty at Georgia Tech, Dr. Gombolay served as a technical staff member at MIT Lincoln Laboratory, transitioning his research to the U.S. Navy, earning him an R&D 100 Award. His publication record includes a best paper award from American Institute for Aeronautics and Astronautics, a finalist for best student paper at the American Controls Conference, and a finalist for best paper at the Conference on Robot Learning. Dr Gombolay was selected as a DARPA Riser in 2018, received 1st place for the Early Career Award from the National Fire Control Symposium, and was awarded a NASA Early Career Fellowship for increasing science autonomy in space.

Zheyuan Wang is a Ph.D. candidate in the School of Electrical and Computer Engineering (ECE) at the Georgia Institute of Technology. He received the B.S. degree and the M. E. degree, both in Electrical Engineering, from Shanghai Jiao Tong University. He also received the M.S. degree from ECE, Georgia Tech. He is currently a Graduate Research Assistant in the Cognitive Optimization and Relational (CORE) Robotics lab, led by Prof. Matthew Gombolay. His current research interests focus on graph-based policy learning that utilizes graph neural networks for representation learning and reinforcement learning for decision-making, with applications to human-robot team collaboration, multi-agent reinforcement learning and stochastic resource optimization.

Reading Materials

Ernesto Nunes, Marie Manner, Hakim Mitiche, and Maria Gini. 2017. A taxonomy for task allocation problems with temporal and ordering constraints. Robotics and Autonomous Systems 90 (2017), 55–70.
Petar Veličković, Guillem Cucurull, Arantxa Casanova, Adriana Romero, Pietro Liò, and Yoshua Bengio. 2018. Graph attention networks. International Conference on Learning Representations (2018).
Xiao Wang, Houye Ji, Chuan Shi, Bai Wang, Yanfang Ye, Peng Cui, and Philip S Yu. 2019. Heterogeneous graph attention network. The World Wide Web Conference (2019), 2022–2032.
Jie Zhou, Ganqu Cui, Shengding Hu, Zhengyan Zhang, Cheng Yang, Zhiyuan Liu, Lifeng Wang, Changcheng Li, and Maosong Sun. 2020. Graph neural networks: A review of methods and applications. AI Open 1 (2020), 57-81.
Zheyuan Wang and Matthew Gombolay. 2020. Learning scheduling policies for multi-robot coordination with graph attention networks. IEEE Robotics and Automation Letters 5, 3 (2020), 4509–4516.
Zheyuan Wang, Chen Liu, and Matthew Gombolay. 2021. Heterogeneous graph attention networks for scalable multi-robot scheduling with temporospatial constraints. Autonomous Robots (2021), 1–20.

Bootcamp Summer 2020 Week 3 – Value Iteration and Q-learning

January 19, 2021 by Matthew Gombolay and Pradyumna Tambwekar

Value iteration and Q-learning make up two fundamental algorithms of Reinforcement Learning (RL). Many of the amazing feats in RL over the past decade, such as Deep Q-Learning for Atari, or AlphaGo, were rooted in these foundations. In this blog, we will cover the underlying model RL uses to describe the world, i.e. a Markov decision process (MDP), as well as two algorithms for performing RL: value iteration and Q-learning. By the end of this blog post, you should be able to understand the connection between value iteration and Q-learning and how to employ either of these algorithms to learn the optimal policy.

This post is divided into three parts:

Markov decision processes
Value functions and Value iteration
Q-learning and Deep Q-learning

Prerequisites:

The only prerequisites you will need to understand this blog are gradient descent and backpropagation. Both of these topics were covered in our first two blogposts (week 1 and week 2).

Markov Decision Processes

An MDP is a 5-tuple describing an environment consisting of the following five components:

States: A state, $s$ is any predefined momentary instance in the world that an agent can exist in. For the rest of this post, we will use the variable $S$ to represent the set of all possible states in the world with $s \in S$ referring to an individual state.
Actions: An action, $a$, is an event facilitated by the agent that can transition you from one state to another provided that such a transition is possible MDP. We will use $A$ to represent the set of all possible actions in the world, with $a \in A$ referring to an individual action. We note that actions may not have deterministic consequences. For example, flipping a coin may not give you the same result each time! The degree to which actions have deterministic effects is described by the transition function.
Transition function: The transition function is a function that defines the probability of moving to a specific next state, given your current state and a valid action. The transition function, $T$, is mathematically defined as follows, $ T: S \times A \times S’ \rightarrow [0,1]$.
Reward: The reward function specifies a real number value that defines the efficacy or a measure of “goodness” for being in a state, taking an action and landing in the next state. Similarly to the transition function, the reward is defined as follows, $R: S \times A \times S’ \rightarrow \rm I\!R $. Note that the state you end up in may be uncontrollable since state transitions can be dynamic.
Discount Factor: The discount factor can be specified using $\gamma$, where $\gamma \in [0,1)$. Note the non-inclusive upper bound for the discount factor (i.e., $\gamma \neq 1$). Disallowing $\gamma = 1$ allows for an MDP to be more mathematically robust. Specifically, the goal for RL algorithms is often to maximize discounted reward across time. Consider the case of an infinite horizon MDP (i.e., the MDP never ends) in which the rewards are always positive. If the discount factor, $\gamma$, is equal to 1, then the sum of future discounted rewards will be infinite, making it difficult RL algorithms to converge (i.e., know when they can stop determine which actions should be taken in each state). Thus, the choice of gamma is a critical for the success of RL algorithms, e.g. Q-learning and value iteration.

You may have read about a concept known as Markov chains. While Markov Chains will not be covered in this post, it is important to understand the distinction between markov chains and markov decision processes, as both concepts share the markov property. Fundamentally, a markov chain consists of the components of a markov decision process except actions, rewards, and a discount factor. An agent in a markov chain is not in control of their movements, the world is in control of the agent’s movements. In a markov decision process the agent has an influence on the outcomes. Thus the transition function of a markov chain is simply defined as $T: S \times S’ \rightarrow [0,1]$.

Markov Property: The markov property holds when the next state depends only on the current state and current action and is independent of the history of states and actions prior to that.

Nota bene: In some MDPs, you may see the initial state distribution included as a part of the MDP. While that is still a valid representation, in this blog, we are going to remain agnostic of the initial state distribution.

Value functions and Value Iteration

The value function, $V(s)$, is a measure of the expected reward you can receive from any given state $s$ given an MDP and a policy describing which actions an agent takes in each state. Formally, a policy, $\pi: S \rightarrow [0,1]^{|A|}$, is a probability distribution over actions, $a \in A$, conditioned on a state, $s$. For the sake of this blog, we will consider a policy to be deterministic (i.e., $\pi(s,a) = 1$ for a single action, $a$, and $\pi(s,a’)=0$ for all other $a’ \neq a$). Having defined a policy, we can now mathematically define the value function for a policy: $$V^{\pi}(s) = E[\sum_{t=0}^{\infty} \gamma^t r_t | \pi, s_0 = s]$$ In a deterministic world, this expectation could be ignored; however, in general, we include the expectation as both the policy (e.g., an agent that doesn’t always like the same thing for breakfast) and the transition function (e.g., how bad traffic is) could describe nondeterministic attributes of the problem we want to model.

The goal of Value Iteration is to find a policy that maximizes the value function: $$\pi^*(s) = argmax_{\pi \in \Pi} V^{\pi}(s), \forall s \in S $$ In this equation $\Pi$ represents the set of all possible policies within the MDP, and $\pi^*(s)$ is the optimal policy. What the value function equation (i.e., $V^{\pi}(s) = E[\sum_{t=0}^{\infty} \gamma^t r_t | \pi, s_0 = s]$) states is that you are going to follow a policy, $\pi$, starting at state, $s$, for up to an infinite number of time steps, $t$, proceeding forward from $t=0$ to $t = \infty$ along a trajectory (i.e., a sequence of state-action pairs, $(\langle s_1,a_1 \rangle , \langle s_2,a_2 \rangle,\ldots)$), where the accumulated reward, $r_t$, is discounted via multiplication with $\gamma^t$. This process of discounting rewards means that, the further you go into the future, rewards have a diminishing impact on value or “goodness” of being in state, $s$. Adjusting $\gamma$ allows you to define your preferences towards optimizing for short-term vs long-term gain. The expectation matters because, as stated in the previous paragraph, transitions can be non-deterministic, thus the expectation provides a normalized estimate of the discounted reward.

Value iteration is a computational algorithm that provides a means of finding the optimal policy, $\pi^*$. The algorithm works by iteratively determining the value of being in each state, $s,$ assuming that the agent takes the best possible action in that state under the current estimate of the value function. The value iteration algorithm is shown in the figure below.

This algorithm iteratively updates the value, $V(s)$, for each state, $s$ until it reaches a point where the change in value is negligible, as defined by the threshold $\Delta_\circ$. Take note of Line 7 in this algorithm in which we consider a state, $s$, an action, $a$, and the resulting state, $s’$, in a sequence, $\langle s, a, s’ \rangle$. Line 7 works by updating the value for the given state, and is called the value update and is reminiscent of Bellman’s optimality condition. This condition says $$ V(s) = \max_{a \in A} E [R(s,a,s’) + \gamma \max_{a’}V(s’)] = \max_{a \in A} \sum_{s’\in S} T(s,a,s’)[R(s,a,s’) + \gamma V(s’)]$$

When this condition holds for a given state, $s$, we would find in Line 8 that $temp – V(s) = 0$. If the condition holds for all states, $s$, then the value function has converged to the optimal value function, $V^{*} = V^{\pi^*}$, and the optimal policy, $\pi^*$, can simply be exctracted by Line 11. In a sense, Value iteration works by iteratively pretending or enforcing (Line 7) that Bellman’s optimality condition holds, measuring if the condition does not hold (Line 8), and, terminating once the condition holds within an acceptable margin of error (Line 3).

Upon convergence (i.e., reaching an acceptable level of error, $\Delta < \Delta_0$), this Value Iteration returns a value function, $V^{*}$, that provides an estimate of the future expected reward while following the optimal policy, $\pi^*$, as well as the optimal policy itself. To be perfectly clear, we should note that the actual value function and policy returned are only approximately optimal, and it would be more technically accurate to refer to them as $V \approx V^*$ and $\pi \approx \pi^*$.

Nota Bene: In some textbooks the bellman’s update equation may be alternatively defined, considering the reward function to be independent of the next state $s’$, i.e. $$V(s) \leftarrow \max_a R(s,a) + \sum_{s’} T(s,a,s’)\gamma V(s’)$$ However the update rule shown in the value iteration algorithm above is a more complete representation of the value estimate because you cannot always assume that, for any given MDP, your reward will not depend on the next state.

Let us now define surrogate terms, the Q-value and the Q-function. Similar to the value function, the Q-function is a quality measure. However the unlike the value function, the Q-function measures the expected discounted reward for taking a particular action at a given state, i.e. $R: S \times A \rightarrow \rm I\!R$. We refer to the Q-value as what the Q-function returns.

Although Q-learning is the go-to algorithm for estimating Q-values and consequently optimal policies from these Q-values, you may also modify the value iteration algorithm to solve for the Q-values.

Explain Value iteration using a Q function instead of a value function — Value Iteration (To compute Q-values)

There are two primary differences between both algorithms resulting from an introduction of a Q-function. First, since the Q-function is dependent on both states and actions, we need to loop through all possible actions (Line 6) as well as the set of all states. Second, you’ll notice that the update rule (Line 8) has changed upon incorporating the Q-function. Since the estimate for the optimal Q-value, $Q(s’,a’)$, for the next state, $s’$, depends on the next action, $a’$, we need to replace “$V(s’)$” from Line 7 of Algorithm 1 with “$\max_{a’} Q(s’,a’)$” in Line 8 of Algorithm 2.

We note that Algorithms 1 and 2 are often referred to as flavors of “exact RL” because, unlike in function approximation algorithms commonly used in RL (i.e., deep learning-based RL), here, we are guaranteed to able to solve for the true, optimal Q-values within a margin of error, $\Delta$, and given sufficient computational resources. “Approximate” RL techniques (e.g., deep RL) do not typically have such guarantees.

An important question to answer is, then, “Why do we not do this in practice instead of Q-learning?” Unfortunately, most real-world applications of RL contain too many states to iterate over even once, let alone the multitude of times required for the algorithm to converge to an acceptable error threshold. Hence, function approximation algorithms, such as Deep Q-learning are preferred.

Q-Learning and Deep Q-learning

Prior to discussing Q-learning itself, let us define a term called the Bellman Residual.

$ \delta(s,a,s’) = [R(s,a,s’) + \gamma \max_{a’}Q(s’,a’)] – [Q(s,a)] $$

The bellman residual, denoted by $\delta$, computes a measurement error for Q-learning to describe how wrong the current estimate of the Q-function is given a transition sequence, $\langle s,a,s’ \rangle$. When the Q-function is optimal, we should find that the modified form of the Bellman optimal condition holds: $$Q(s,a) = E[R(s,a,s’) + \gamma \max_{a’\in A} Q(s’,a’)] = \sum_{s’} T(s,a,s’)[R(s,a,s’) + \gamma \max_{a’\in A} Q(s’,a’)]$$

When this condition holds, we should find that $ \delta(s,a,s’) = 0 $ for all possible transition sequences. However, when we first initialize a Q-function, as we do in Line 2 of Algorithm 3, our Q-function will almost certainly be wrong, and $ \delta(s,a,s’)$ will not equal zero. So, how do we improve our Q-function to be correct?

The way Q-learning works is by “blending” our current estimate of the Q-function, $Q(s,a)$, with a “point-estimate” of what the Q-function should be, i.e. $R(s,a,s’) + \gamma \max_{a’}Q(s’,a’)$. The degree of blending is controlled by a hyperparemeter, $\alpha$ in the following equation: $$ Q(s,a) = (1 – \alpha) Q(s,a) + \alpha (R(s,a,s’) + \gamma \max_{a’}Q(s’,a’)) $$

If $\alpha = 1$, then we completely ignore our current estimate of the Q-function. If $\alpha=0$, then we completely ignore new information. A decent starting place for setting $\alpha$ in practice is to choose $\alpha = 0.99$, which puts most of the emphasis on the current estimate of the Q-function. However, we encouage you to experiment with this parameter!

The full algorithm for Q-learning is shown in the algorithm pictured below.

It is important to note that, unlike in the version of the bellman equation described earlier, Q-learning does not include a transition function in its Bellman update equation! In Q-learning, we instead simulate actions within the environment and utilize the states visited during the simulation (i.e., the trajectory) to apply the bellman equation. The simulate function in the algorithm shows how trajectories are sampled in Q-learning. In each timestep of the trajectory, we use an epsilon-greedy sampling strategy to select an action. If a randomly sampled probability is greater than a predefined $\epsilon$ value, then we greedily sample the action, otherwise we randomly sample an action from the space of all actions. Upon selection, we simulate that action and observe the next state and reward received for that action. Finally we store all this observed information into the memory buffer, update the current state and repeat this process until we reach a terminal state. The trajectory computed from each simulation is then used to update the Q-values via the Bellman update equation (line 6 in Q-learning).

The absence of a transition function makes Q-learning a model-free RL algorithm, as it does not need any prior knowledge of “the world” to learn the optimal policy. This model-free characteristic is significant because, in practice, you rarely have access to the transition probabilities for all state action pairs in real-world application.

With enough time and sampled trajectories, Q-learning will be able to estimate the optimal Q values for each state action pair. However, as currently constructed, the Q-learning algorithm requires a lookup table for each possible state action pair, and filling in those values. The size of the state-action spaces in real-world tasks makes maintaining a lookup table of Q-values infeasible. One possible solution to this lack of computational power is the use of function approximation techniques, e.g. Deep Q-learning. The term “deep” comes from parameterizing Q-values using neural networks. Therefore, instead of learning a literal table of Q-values, our goal shifts towards learning the weights of a neural network that can output the Q-value for any given state action pair. Employing the square of the Bellman residual as our loss function, we can apply the backpropagation algorithm (Covered in Week 2) to learn the weights of the Q-network.

When compared to Q-learning, there are three primary reasons why Deep Q-learning is better suited for real world applications,

Not enough memory: Q-learning requires storing a lookup table of all states and actions, leading to an infeasible RAM requirement for most modern applications.
Not enough time: Deep Q-learning is often significantly faster to train because of the ability to incorporate techniques like batched gradient descent and adaptive optimization. In addition to this, Deep Q-learning does not require convergence of the Q-value for each state action pair.
Interpolation: Deep Q-Learning is a function approximation algorithm. Because the Q-network approximates Q-values for states and actions, we can assume that the network can interpolate q-values for similar state and action pairs i.e, $ Q(s,a) \approx Q(s’,a’)$ if $(s,a) \approx (s’,a’)$. In practice, Deep Q-learning is more efficient than seeking convergence for each individual state-action pair. However, there are pathological cases where this may not be true, for example, when the slightest change in the state or action may result in drastically different outcomes.

While we have not yet covered enough yet to fully implement Deep Q-learning and understand the tips-and-trick in practically implementing a Deep Q-learning model, please stay tuned for a future blog on that topic!

Key Takeaways from this blog:

A Markov Decision Process(MDP) is a tuple used to describe an environment and is made up of components used in reinforcement learning algorithms.
The goal of most reinforcement learning algorithms is to maximize the cumulative expected reward, which is also known as the value function.
Value iteration is an iterative algorithm that uses the bellman equation to compute the optimal MDP policy and its value.
Q-learning, and its deep-learning substitute, is a model-free RL algorithm that learns the optimal MDP policy using Q-values which estimate the “value” of taking an action at a given state.

Bootcamp Summer 2020 Week 2 — Neural Networks and Backpropagation

January 18, 2021 by Rohan Paleja and Matthew Gombolay

Previously, we covered gradient descent, an optimization method that adjusts the values of a set of parameters (e.g., the parameters of a linear regression model) to minimize a loss function (e.g., the mean squared error of a linear regression model network in predicting the value of a home given its zip code and other information). In this post, we introduce neural networks, which are essentially hyperparameter function approximators that learns to map a set of inputs to a set of outputs.

To optimize a neural network model with respect to a given loss function, we could directly apply gradient descent to adapt the parameters of a neural network to learn a desired mapping of inputs to outputs, this process would be inefficient. Due to the large number of parameters, performing symbolic differentiation as introduced in our gradient descent lesson would require a lot of redundant computation and slow down the optimization process tremendously. Fortunately, there is a better way: the backpropagation algorithm. Backpropagation is a form of auto-differentiation that allows us to more efficiently compute the derivatives of the neural network (or other model’s) outputs with respect to each of its parameters. Backpropagation is often overlooked or misunderstood as a simple application of symbolic differentiation (i.e., the chain rule from Calculus), but it is much, much more.

This blog will cover

Neural Networks
Forward Propagation
Backpropagation

We will not take the time to discuss the biological plausibility of neural networks, but feel free to check out this paper.

Neural Networks

We will start by understanding what a neural network is. At its core, a neural network is a function approximator. Functions from algebra, such as $y=x^2$ and $y=2x+1$, can be represented by neural networks. Neural networks can be used to learn a mapping given a set of data. For example, a neural network could be used to model the relationship of the age of a house, the number of bedrooms, and the square footage with the value of a house (displayed in Figure 1).

A sample relationship a neural network can be expected to learn.

While the example in Figure 1 may appear trivial, neural networks can solve much more challenging problems. For example, neural networks have been applied to take as input Magnetic Resonance Imaging (MRI) data and output diagnoses for the patient based upon what is present in the MRI. Neural networks have also had success in being applied to problems of “machine translation” in which one attempts to translate one language to another (e.g., English to Chinese).

Typically, neural networks are given a dataset $\mathcal{D}=\{(x_k,y_k),\ k\in [n]\}$, where $(x_k,y_k)$ is the “example” number $k$, $x_k$ is the input to a model we want to train, and $y_k$ is the “label” we want our model to learn to predict from $x_k$. We generalize the notation to $x$ being the input data, and $y$ being the set of ground-truth labels. A neural network can be represented as a function $\hat{y} = f(x)$ (i.e., $f: X \rightarrow Y$). Here, we use “$\hat{y}$” as the output of the neural network instead of “$y$,” as $\hat{y}$ is our estimate of the ground-truth label, $y$. Later, we will use the difference (i.e., error) between $\hat{y}$ and $y$, to optimize our neural network’s parameters.

If we look inside the function $f$, we see nodes and edges. These nodes and edges make up a directed graph, in which an input is operated on from left to right.

A more colorful inside look into a neural network. The output node in this example neural network uses a linear activation, denoted by the diagonal line (“/”).

A closer look inside a smaller neural network is shown in Figure 2. Figure 2 displays how an input, $x$ (with cardinality $|x|=d$), is mapped to an output, $\hat{y}$, a scalar. Before we move into what is happening here, let’s will first explain the notation.

We have neural network weights, denoted by $w$, that weight the incoming value through multiplication. The subscripts represent the (to, from) nodes respectively, and the superscript denotes the layer number resulting in the notation as $w^{(layer)}_{to,from}$. The terms with the number “1” inside are termed bias nodes and are represented in the notation “b_{to}”.

As an example, we present a simple neural network describing $\hat{y} = 2x + 4z +3$ in Figure 4.

An example neural network representing $\hat{y}=2x + 4z +3$

When an input is passed into a neural network, the input $x$ is multiplied by respective weights (denoted by $w$) and added to by a bias term $b$, resulting in a new value represented within a hidden node. In our model, we set the number of nodes in each layer to $n$. Each node, along with its incoming edges, represents some inner function. As our neural network can have many nodes and many layers, neural networks can be thought of as a composition of these inner functions to produce a more complex, outer function $f$ described above. The backslash in the most-right node represents a summation, resulting in a scalar value output. Now we will discuss precisely how to produce an output $\hat{y}$ given an input $x$ through learning a mapping.

Forward Propagation

The final output $\hat{y}$ is computed from the node values and weights of the previous layer, in our case, layer 2. We can multiply the output of the hidden nodes in layer 2, denoted by $o^{(2)}_i$, where $i$ represents a node index from $i$ to $n$ (i.e., $i \in \{1,2,\ldots,n\}$, by the weights connecting these nodes with the output node and add the bias term b^{(3)}_{i}. We display this process in Equation 1.

\begin{align}
\hat{y} = \sum_{i=1}^n w_{1,i}^{(3)} o^{(2)}_i + b^{(3)}_{1}
\end{align}

But, what is $o_i^{(2)}$ equal to? We display how $o_i^{(2)}$ is computed in Equation 2.

\begin{align}
o_i^{(2)} = g(z_i^{(2)})
\end{align}

Here, $i$ is again the node index and $g$ is an activation function (i.e., $g: Z \rightarrow O$). Neural networks cannot represent nonlinear functions without utilizing nonlinear activation functions. Activation functions help neural networks represent more complex functions by transforming node values by a nonlinear function. Common activation functions used include ReLU, tanh, sigmoid, etc. Here, we will discuss and use the ReLU (Rectified Linear Unit) as our activation function.

For our example, we will use the ReLU activation function, as shown in Equation 3.

\begin{equation}g(z) =\bigg\{\begin{aligned} &z \text{ if }z \geq 0\\ &0 \text{ otherwise}\end{aligned}\bigg\}\end{equation}

It is also helpful to know the gradient of the ReLU function, displayed in Equation 4.

\begin{equation}g'(z) =\bigg\{\begin{aligned} &1 \text{ if }z \geq 0\\ &0 \text{ otherwise}\end{aligned}\bigg\}\end{equation}

Moving back to Equation 2, we solve for $z_i^{(2)}$ in Equation 5.

\begin{align}
z^{(2)}_i = \sum_{j=1}^n w_{i,j}^{(2)} o^{(1)}_j + b^{(2)}_{i}
\end{align}

This procedure can be used throughout the neural network, computing the current layer using the previous layer’s output. For simplicity of algorithmic notation, you can consider the neural network’s input, $x$, as $o^{(0)}$. Here, $o^{(0)}$ represents the hypothetical output of a hypothetical zeroth layer. In Algorithm 1, we display the complete forward propagation algorithm, expressing the layer number as a variable for generalizability. This algorithm computes a predicted output, represented by $\hat{y}$, given input features, $x$.

We can now analyze the computational complexity of this algorithm. For the sake of simplifying the analysis, we assume the number of layers in the neural network, $l$, is equal to the number of nodes in each layer, $n$, which we also set as equal to the size of the input, $|x|=d$. Thus, $n=l=w$. In Algorithm 1, Steps 3 and 5 represent a matrix multiplication combined with addition. This process has a complexity of $O(n^2+n)$. As a reminder, for any matrix multiplication of two matrices of size $p \times q$ and $q \times r$, respectively, the computational complexity of this multiplication is $O(pqr)$. Thus, since we are multiplying weight matrices of size $n \times n$ to a vector of size $n \times 1$, the complexity of the matrix multiplication alone is $O(n \times n \times 1) = O(n^2)$. The addition operation adds complexity $O(n)$, resulting in $O(n^2+n)$ for Steps 2-6. Steps 8-10 have a complexity of $O(n)$ leading, resulting now in the complexity $O(n^2 + 2n)$. Since Step 1 iterates $n=l$ times, we arrive at a total complexity of $O(n(n^2 + 2n))=O(n^3)$ for Algorithm 1.

Now that we can perform a forward pass and know its complexity, we can discuss how to update our weights and attempt to learn a model that fits the data.

Backpropagation

To optimize our neural network weights to more accurately represent our data, we first establish a loss function that quantifies our model’s error (i.e., how wrong its outputs are for a desired task). Given this loss function, we could then compute the gradients of our loss function with respect to each weight. We could then perform gradient descent to optimize our model parameters as we did in the gradient descent blog post. However, we will now do something a bit more sophisticated to make the optimization process more computationally efficient.

Let’s consider a common loss function, as shown in Equation 6. This loss function represents a mean-squared error loss between the true label (true y-value in the data) and the predicted value using our neural network ($\hat{y}$).

\begin{align}
L=\frac{1}{2} (y-\hat{y})^2 = \frac{1}{2} (y-f(x))^2
\end{align}

The derivative of this loss function with respect to $w_{1,1}^{(1)}$ is displayed in Equation 7 according to chain rule.

\begin{align}
\frac{\partial L}{\partial w_{1,1}^{(1)}} = (y-\hat{y}) * \frac{\partial \hat{y}}{\partial w_{1,1}^{(1)}} \bigg|_x
\end{align}

So, how do we comput $\frac{\partial \hat{y}}{\partial w_{1,1}^{(1)}}$? There are 4 paths from the output node to $w_{1,1}^{(1)}$, which highlighted in blue in Figure 4. The upper-most path has the gradient shown in Equation 8.

\begin{align}
\frac{\partial \hat{y}}{\partial w_{1,1}^{(1)}} = \frac{\partial \hat{y}}{\partial o^{(3)}} \frac{\partial o^{(3)}}{\partial z^{(3)}} \frac{\partial z^{(3)}}{\partial o^{(2)}_1} \frac{\partial o^{(2)}_1}{\partial z^{(2)}_1} \frac{\partial z^{(2)}_1}{\partial o^{(1)}_1} \frac{\partial o^{(1)}_1}{\partial w_{1,1}^{(1)}}
\end{align}

The rule for computing the “total derivative” encompassing all of these paths can be seen through an example here.

A depiction of neural network dependencies on $w_{1,1}^{(1)}$.

Computing a single partial derivative is computationally expensive, and computing each weight’s partial derivative individually (e.g., (e.g., $\frac{\partial \hat{y}}{\partial w_{1,1}^{(1)}}$ and then $\frac{\partial \hat{y}}{\partial w_{1,2}^{(1)}}$) and so forth all the way to $\frac{\partial \hat{y}}{\partial w_{1,n}^{(l)}}$) would be terribly inneficient.

If we instead look at $\frac{\partial \hat{y}}{\partial w_{1,2}^{(1)}}$, you can see that it shares many of the same nodes as $\frac{\partial \hat{y}}{\partial w_{1,1}^{(1)}}$. In Figure 5, we display a depiction of gradient computations required for $w_{1,1}^{(1)}$, $w_{2,1}^{(1)}$, and the overlap in the computations.

A depiction of neural network dependencies on $w_{1,1}^{(1)}$ (left), neural network dependencies on $w_{2,1}^{(1)}$ (middle), and the overlap between depedencies.

We would like to take advantage of this overlap in computation, which is exactly where the Backpropagation algorithm comes in. Backpropagation is a form of reverse-mode automatic differentiation that allows us to eliminate redundant computation in applying the chain rule to compute the gradient of our neural network.

Applied to neural networks, the backpropagation algorithm starts from the last layer of the neural network, computes an error term (denoted as $\delta$), and multiplies this error term by the gradient. The algorithm computes each node’s respective gradient in memory and reuses these variables as the algorithm continues, backpropagating the information for efficiency. We display the algorithm for Backpropagation in Algorithm 2. The output of the Backpropagation algorithm is gradients for each set of weights and biases.

Looking at the computational complexity of this algorithm, we see Step 4 has a complexity of $O(1)$. Step 6 has a complexity of $O(n^2 + n)$ or $O(n^2)$. Step 10 has a complexity of $O(n)$. Steps 12 and 14 both have complexities of $O(n^2)$. Since we iterate over the number of layers, equal to $n$ for our analysis, the total complexity is $O(n^3)$. If we did not use the Backpropagation algorithm, we would end up redundantly computing each component’s derivatives multiple times. Using the previous assumptions setting the cardinality of $|x|$, the number of nodes, and the number of layers to $n$, we would be iterating over $O(n^3 + n^2)$ weight derivatives resulting in a total complexity of $O(n^6)$.

The last step in optimizing our neural network model to minimize our loss function is to update weights and biases. Equations 9-10 show one step of gradient descent using the derivatives from our backpropagation procedure in Algorithm 2. Note that Equations 9-10 represent only one step of gradient descent, so one would need to iteratively perform the forward pass, then backward pass, then parameter updates (Equations 9-10) until the model converges to a local minimum, as described in our previous post on gradient descent.

\begin{align}
w^{(l)} \to w^{(l)} + \alpha \Delta w^{(l)}
\end{align}
\begin{align}
b^{(l)} \to b^{(l)} + \alpha \Delta b^{(l)}
\end{align}

We now have the ability to train neural networks in a computationally efficient manner, which, in turn, allows us to do more training iterations per unit time, which should help us train more accurate neural networks!

Key Points throughout this Blog:

Neural networks can be used to represent almost any function.
Forward propagation is the process of taking input features, $x$, and computing an output in the form of a model’s estimate, $\hat{y}$.
We can use the difference in the output, $\hat{y}$, of the neural network and our ground-truth data, $y$, to compute a loss, and then use this loss value to compute gradients in the backpropagation algorithm.
Backpropagation is a special case of reverse-mode automatic differentiation and allows for efficient gradient computation. The result is that the training of extremely large neural networks is possible!

Bootcamp Summer 2020 Week 4 – Policy Iteration and Policy Gradient

December 16, 2020 by Manisha Natarajan and Matthew Gombolay

In our previous blog post on Value Iteration & Q-learning, we introduced the Markov Decision Process (MDP) as a helpful model for describing the world, and we described how a robot could apply Reinforcement Learning (RL) techniques to learn a policy, $\pi: S \rightarrow A$, that maps the state of the world, $s \in S$, to the action the robot should take, $a \in A$. The goal of RL is to learn the optimal policy, $\pi^*$, that maximizes the expected, future, discounted reward (or return), $V^{\pi}(s)$, the agent receives when starting in state, $s$, and following policy, $\pi$ as given by the equation below.

\begin{align}
\pi^* = \text{arg}\max\limits_{\pi \in \Pi} V^{\pi}(s) = \text{arg}\max\limits_{\pi \in \Pi} \sum\limits_{s’}T(s,\pi(s),s’)[R(s,\pi(a|s),s’) + \gamma V_{\pi}(s’)]
\end{align}

We note that, in the tabular setting (e.g., for value and policy iteration, where one uses a look-up table to store information), the policy is deterministic (i.e, $\pi: S \rightarrow A$). However, when we move to deep learning-based representations of the policy (see the section on Policy Gradient below), the policy typically represents a policy distribution over actions the robot would sample from for acting in the world (i.e., $\pi: S \rightarrow [0,1]^{|A|}$).

To find the optimal policy, we described two approaches in our previous post (i.e., Value Iteration & Q-learning). Value iteration computes the optimal value function, $V^{*}(s)$, from which one can find the optimal policy given by $\pi^*(s) = \text{arg}\max\limits_{a \in A} \sum\limits_{s’}T(s,a,s’)[R(s,\pi(a|s),s’) + \gamma V^{*}(s’)]$. We also introduced Q-learning, in which one can extract the optimal policy by $\pi^*(s) = \text{arg}\max\limits_{a \in A} Q^*(s,a)$, where $Q^*$ is the optimal Q-function. Finally, we extended Q-learning to Deep Q-learning where the Q-function is represented as a neural network rather than storing the Q-values for each state-action pair in a look-up table.

In this blog post, we will follow a similar procedure for two new concepts: (1) Policy Iteration and (2) Policy Gradients (and REINFORCE). Like value iteration, policy iteration is a tabular method for reinforcement learning. Similar to Deep Q-learning, policy gradients are a function approximation-based RL method.

Policy Iteration

The pseudocode for policy iteration is given in Figure 1. At a high level, this algorithm first initializes a value function, $V(s)$, and a policy, $\pi(s)$, which are almost surely incorrect. That’s okay! The next two steps, (2) Policy Evaluation and (3) Policy Improvement, work by iteratively correcting the value function given the policy, correcting the policy given the value function, correcting the value function given the policy, and so forth. Let’s break down the policy evaluation and policy improvement steps.

Policy Evaluation

Policy evaluation involves computing the value function, $V^{\pi}(s)$, which we know how to do from our previous lesson on value iteration.
\begin{align}
V^\pi(s) = \sum\limits_{s’}T(s,\pi(s),s’)[R(s,\pi(s),s’) + \gamma V_{\pi}(s’)]
\end{align}

In policy iteration, the initial value function is chosen arbitrarily, (can be all zeroes, except at the terminal state – which will be the episode reward), and each successive approximation is computed using the following update rule:
\begin{align}
V_{k+1}(s) \: = \: \sum\limits_{s’}T(s,\pi(s),s’)[R(s,\pi(a|s),s’) + \gamma V_{k}(s’)]
\label{Bellman}
\end{align}

We keep updating the value function for the current policy using equation \ref{Bellman} until it converges (i.e., no state value is updated more than $\Delta$ during the previous iteration.

A key benefit of the policy evaluation step is that the DO-WHILE loop can exactly be solved as a linear program (LP)! We do not actually need to use dynamic programming to iteratively estimate the value function! Linear programs are incredibly fast, enabling us to quickly, efficiently find the value of our current policy. We contrast this ability with Value Iteration in which a $\max$ operator is required in the equation $V(s) = \text{arg}\max\limits_{a \in A} \sum\limits_{s’}T(s,a,s’)[R(s,a,s’) + \gamma V_{\pi}(s’)]$, which is nonlinear. Thus, we have identified one nice benefit of policy iteration already!

Policy Improvement

In the policy evaluation step, we determine the value of our current policy. With this value, we can then improve our policy.

Suppose we have determined the value function $V_\pi$ for a suboptimal, deterministic policy, $\pi$. By definition, then, the value taking the optimal action in a given state, $s$, would be at least as good if not better than the value of taking the action dictated by $\pi$. This inequality is given by:

\begin{align}
V^{\pi}(s) = \sum\limits_{s’}T(s,\pi(a|s),s’)[R(s,\pi(s),s’) + \gamma V^{\pi} (s’)] \leq \text{arg}\max\limits_{a \in A} \sum\limits_{s’}T(s,\pi(a|s),s’)[R(s,\pi(s),s’) + \gamma V^{\pi} (s’)], \forall s \in S
\end{align}

Thus, we should be able to find a better policy, $\pi'(s)$, by simply setting choosing the best action as given by the value function, $V^{\pi}(s)$ for our suboptimal policy, $\pi$, as given by:

\begin{align}
\pi'(s) \leftarrow \text{arg}\max\limits_{a \in A} \sum\limits_{s’} T(s’,a,s) [R(s,a,s’) + \gamma V(s’)]
\end{align}

Thus, we can replace $\pi$ with $\pi’$, and we are done improving our policy until we have a new value function estimate (i.e., by repeating the Policy Evaluation Step). The process of repeatedly computing the value function and improving policies starting from an initial policy $\pi$ (until convergence) describes policy iteration.

Just like how Value Iteration & Q-learning have their deep learning counterparts, so does our Policy Iteration algorithm.

Policy Gradient

As explained above, RL seeks to find a policy, $\pi$, that maximizes the expected, discounted, future reward given by following the actions dictated by policy, $\pi(s)$ in each state, $s$. In Policy Iteration, we first compute the value, $V^{\pi}(s)$, of each state and use these value estimates to improve the policy, $\pi$. Let $\theta$ denote the policy parameters of a neural network. We denote this policy as $\pi_{\theta}(s)$. Unlike the policy iteration method, policy gradient methods learn a parameterized policy (commonly parameterized by a neural network) to choose actions without having to rely on a table to store state-action pairs (and, arguably, without having to rely on an explicit value function; however, that depends on your formulation, which we will return to later).

With policy gradients, we seek to find the parameters, $\theta$, that maximizes the expected future reward. Hence, policy gradient methods can be formulated as a maximization problem with the objective function being the expected future reward as depicted in Equation \ref{PG_obj}, where $V^{\pi_\theta}$ is the value function for policy parameterized by $\theta$.
\begin{align}
J(\theta) \: = \mathbb{E}_{s \sim \rho(\cdot)} V^{\pi_\theta}(s) = \mathbb{E}_{s \sim \rho(\cdot), a \sim \pi(s)} \left[\sum\limits_{t=0}^\infty \gamma^t r_{t} | s_0 = s\right]
\label{PG_obj}
\end{align}

To maximize $J(\theta)$, we perform gradient ascent as shown in Equation \ref{gradient ascent}. For more details on solving optimization problems with gradient-based approaches, please see the blog post on Gradient Descent.
\begin{align}
\theta_{k+1} = \theta_k + \alpha \nabla_\theta J(\theta)
\label{gradient ascent}
\end{align}

In the case of policy gradient methods, the action probabilities change smoothly as a function of the learned parameters, $\theta$, whereas in tabular methods (e.g., Policy Iteration), the actions may change drastically even for small changes in action value function estimates. Thus, policy gradient-based methods may be more sample-efficient when this assumption holds by essentially interpolating the right action when in a new state (or interpolating the right Q-value in the case of Q-learning). However, if this smoothness property does not hold, neural network function approximation-based RL methods, e.g., Deep Q-learning or Policy Gradients, will struggle. Nonetheless, this smoothness assumption does commonly hold — at least enough — that deep learning-based methods are the mainstay of modern RL.

The Policy Gradient Theorem

(This derivation is adapted from Reinforcement Learning by Sutton & Barto)

We will now compute the gradient of the objective function $J(\theta)$ with respect to the policy parameter $\theta$. Henceforth, we will assume that the discount factor, $\gamma=1$, and $\pi$ in the derivation below represents a policy parameterized by $\theta$.
\begin{align}
\begin{split}
\nabla_\theta J(\theta) &= \nabla_\theta(V_{\pi_\theta}) \\
&= \nabla_\theta \left[\sum\limits_a \pi_\theta(a|s) Q_{\pi}(s,a)\right] \; \; \; \; \; \; \; \; \; \; \; \; \; \; \; \; \; \; \; \; \; \; \; \; \; \; \; \; \; \; \; \; \; \; \; \; \; \; \; \; \; \; \; \; \; \; \; \; \; \; \; \; \; \; \; \; \; \; \; \; \; \; \; \; \; \; \; \; \; \; \; \; \; \; \; \; \; \; \; \; \; \; \; \; \; \; \; \; \text{(From }\ref{Q_v})\\
&= \sum\limits_a \Bigg[\nabla_\theta \pi_\theta(a|s) Q_{\pi}(s,a) + \pi_\theta(a|s) \nabla_\theta Q_{\pi}(s,a) \Bigg] \; \; \; \; \; \; \; \; \; \; \; \; \; \; \; \; \; \; \; \; \; \; \; \; \; \; \; \; \; \; \; \; \; \; \; \; \; \; \; \; \; \; \; \; \; \; \; \; \; \; \; \; \; \; \; \; \; \text{(by product rule)}\\
&= \sum\limits_a \Bigg[\nabla_\theta \pi_\theta(a|s) Q_{\pi}(s,a) + \pi_\theta(a|s) \nabla_\theta \left[\sum\limits_{s’} T(s,a,s’)[R(s,a,s’) + V_{\pi}(s’)] \right]\Bigg]\\
&= \sum\limits_a \Bigg[\nabla_\theta \pi_\theta(a|s) Q_{\pi}(s,a) + \pi_\theta(a|s) \sum\limits_{s’} T(s,a,s’)\nabla_\theta V_{\pi}(s’) \Bigg]\\
\text{Unrolling $V_{\pi}(s’)$, we have… }\\
&= \sum\limits_a \Bigg[\nabla_\theta \pi_\theta(a|s) Q_{\pi}(s,a) + \pi_\theta(a|s) \sum\limits_{s’} T(s,a,s’) \sum\limits_{a’} \Bigg[\nabla_\theta \pi(a’|s’)Q_{\pi}(s’,a’) + \pi(a’|s’)\sum\limits_{s'{}’} T(s’,a’,s'{}’) \nabla_\theta V_{\pi}(s'{}’) \Bigg] \Bigg]\\
\end{split}
\end{align}
We can continue to unroll $V_{\pi}(s'{}’)$ and so on…Unrolling ad infinitum, we can see:
\begin{align}
\nabla_\theta J(\theta) \: = \: \sum\limits_{s \in \mathcal{S}} \sum\limits_{k=0}^{\infty}Pr(s_0 \rightarrow s, k, \pi)\sum\limits_a \nabla_\theta \pi_\theta(a|s) Q_\pi(s,a)
\label{PG_1}
\end{align}
Here, $Pr(s_0 \rightarrow s, k, \pi)$ is the probability of transitioning from state $s_0$ to $s$ in $k$ steps while following the policy $\pi$. Rewriting $Pr(s_0 \rightarrow s, k, \pi)$ as $\eta(s)$ in Equation \ref{PG_1}, we have:
\begin{align}
\begin{split}
\nabla_\theta J(\theta) \: &= \: \sum\limits_{s} \eta(s)\sum\limits_a \nabla_\theta \pi_\theta(a|s) Q_\pi(s,a)\\
&= \: \sum\limits_{s’} \eta(s’) \sum\limits_{s} \frac{\eta(s)}{\sum\limits_{s’}\eta(s’)} \sum\limits_a \nabla_\theta \pi_\theta(a|s) Q_\pi(s,a)\\
&= \: \sum\limits_{s’} \eta(s’) \sum\limits_{s} \mu(s) \sum\limits_a \nabla_\theta \pi_\theta(a|s) Q_\pi(s,a)\\
&\propto \: \sum\limits_{s} \mu(s) \sum\limits_a \nabla_\theta \pi_\theta(a|s) Q_\pi(s,a)\\
\end{split}
\label{PG_final}
\end{align}

We note that we set $\mu(s) = \frac{\eta(s)}{\sum\limits_{s’}\eta(s’)}$ for convenience. Now, one might ask, what about the derivation for $\gamma \neq 1$? Well, that is a great question! Candidly, we have not seen a clean derivation when $\gamma \in (0,1)$, and we would welcome any readers out there to email us should such a derivation exist that we could link to!

REINFORCE – Monte Carlo Policy Gradient

From Equation \ref{PG_final}, we find an expression proportional to the gradient. Taking a closer look at the right-hand side of Equation \ref{PG_final}, we note that it is a summation over all states, weighted by how often these states are encountered under policy $\pi$. Thus, we can re-write Equation \ref{PG_final} as an expectation:
\begin{align}
\begin{split}
\nabla_\theta J(\theta) \: &\propto \: \sum\limits_{s} \mu(s) \sum\limits_a \nabla_\theta \pi_\theta(a|s) Q_\pi(s,a)\\
&= \: \mathbb{E}_{s \sim \mu(\cdot)}\left[\sum\limits_a \nabla_\theta \pi_\theta(a|s) Q_\pi(s,a) \right] \\
\end{split}
\label{PG_grad}
\end{align}
We modify the gradient expression in Equation \ref{PG_grad} by (1) introducing an additional weighting factor $\pi_\theta(a|s)$ and dividing by the same without changing the equality, and (2) sampling an action from the distribution instead of summing over all actions. The modified update rule for the policy gradient algorithm is shown in Equation \ref{REINFORCE}.
\begin{align}
\begin{split}
\nabla_\theta J(\theta)\: &= \: \mathbb{E}_{s \sim \mu(\cdot)}\left[\sum\limits_a \pi_\theta(a|s) Q_\pi(s,a) \frac{\nabla_\theta \pi(a|s)}{\pi(a|s)} \right] \\
&= \: \mathbb{E}_{s \sim \mu(\cdot), a\sim \pi(\cdot|s)}\left[Q_\pi(s,a) \frac{\nabla_\theta \pi_\theta(a|s)}{\pi_\theta(a|s)} \right] \; \; \; \; \; \; \; \; \; \; \; \; \; \; \; \; \; \; \; \; \; \; \; \; \; \; \; \; \; \; \; \; \; \; \; \; \; \; \; \; \; \; \; \; \text{($a$ here is sampled from $\pi$)}\\
&= \: \mathbb{E}_{s \sim \mu(\cdot), a\sim \pi(\cdot|s)}\left[Q_\pi(s,a) \nabla_\theta \: \text{log}\: \: \pi_\theta(a|s) \right] \; \; \; \; \; \; \; \; \; \; \; \; \; \; \; \; \; \; \; \; \; \; \; \; \; \; \; \; \; \; \; \; \; \; \; \; \; \; \; \left(\nabla_\theta \: \text{log}\: \: \pi_\theta(a|s) = \frac{\nabla_\theta \pi_\theta(a|s)}{\pi_\theta(a|s)} \right)\\
&= \: \mathbb{E}_{s \sim \mu(\cdot), a\sim \pi(\cdot|s)}\left[G_t \nabla_\theta \: \text{log}\: \: \pi_\theta(a|s) \right] \; \; \; \; \; \; \; \; \; \; \; \; \; \; \; \; \; \; \; \; \; \; \; \; \; \; \; \; \; \; \; \; \; \; \; \; \; \; \; \; \; \; \; \; \; \; \; \; \; \text{($G_t$ is the return, and $\mathbb{E}_{\pi}[G_t|S_t=s, A_t=a] = Q_\pi(s,a)$)}\\
\end{split}
\label{REINFORCE}
\end{align}
Including the logarithm creates a weighting factor based on the probability of occurrence of different state-action pairs, allowing us to leverage the expected gradient update over actions without numerically estimating the expectation! The final expression within the expectation in Equation \ref{REINFORCE}, is the quantity that can be sampled at each timestep to update the gradient. The REINFORCE update using gradient ascent is described in Equation \ref{grad_update}.
\begin{align}
\theta_{t+1} \: \dot{=} \: \theta_t + \alpha G_t \: \nabla_{\theta} \: \text{log}\: \: \pi_{\theta_t}(a|s)
\label{grad_update}
\end{align}

Fig. 2 shows the complete REINFORCE algorithm.

Returning to our aside from earlier about whether policy gradients rely on an estimate of the value function, our derivation here, as depicted in Fig. 2, does not rely on the value function. However, as our derivation alluded to, one could use the value function for $G_t = V^{\pi}(s)$. One could also use the Q-function, $G_t = Q^{\pi}(s,a)$. When setting $G$ equal to the value or Q-function, we refer to the the update as a actor-critic (AC) method. When we use the advantage function formulation, $G_t = Q^{\pi}(s,a) – V^{\pi}(s)$, the update is known as an advantage function, actor-critic (A2C) method. For AC and A2C, we need neural network function approximators to estimate the Q- and/or value functions. While adding on and learning from these additional neural networks adds computational complexity, AC and A2C often work better in practice than REINFORCE. However, that is a point that we will leave for a future blog!

Takeaways

A policy is a mapping from states to probabilities of selecting every possible action in that state. A policy $\pi^*$ is said to be optimal if its expected return is greater than or equal to any other policy $\pi$ for all states.
Policy Iteration is a non-parametric (i.e., tabular, exact, not deep learning-based) approach to compute the optimal policy. Policy iteration alternates between policy evaluation (computing the value function given a policy) and policy improvement (given a value function) to converge to the optimal policy.
Policy gradients are a powerful, deep learning-based approach to learning an optimal policy with neural network function approximation for the policy.

References

Richard S. Sutton and Andrew G. Barto. 2018. Reinforcement Learning: An Introduction. A Bradford Book, Cambridge, MA, USA.

Joint Goal and Strategy Inference across Heterogeneous Demonstrators via Reward Network Distillation

August 16, 2020 by Zac Chen, Rohan Paleja, Leng and Matthew Gombolay

“Learning from Demonstration” (LfD) techniques aim to enable robots to be taught by non-roboticists how to perform tasks, such as folding laundry, driving a car, or performing surgery. For decades, people have relied on the incorrect assumption that every teacher performs tasks similarly, i.e. everyone folds laundry or performs surgery in a similar fashion. However, this assumption is often not true. Humans (and even experts) typically adopt heuristics (i.e., “mental shortcuts”) to solve challenging optimization problems due to a human’s limited working memory. These highly-refined strategies present heterogeneity in behavior across human task demonstrators. Only recently have researchers sought to relax this assumption of demonstrator homogeneity. However, even these new approaches do not explicitly tease out precisely what was similar in all the demonstrations versus what was unique to each teacher. Rather, some previous methods divide the demonstrations into clusters beforehand and perform LfD for homogeneous demonstrations within one cluster. However, this limits the number of data accessible to LfD algorithms, resulting in worse learning.

We take a new perspective in LfD by modeling every human teacher as having a shared notion of the desired task (a common goal) along with a teacher-specific notion (e.g., strategy, or style) for performing the task. Accordingly, we present a novel method, Multi-Strategy Reward Distillation (MSRD), a novel Inverse Reinforcement Learning (IRL) technique that automatically finds and takes advantage of the common goal shared by all teachers (e.g., perform the task well) while teasing out and accounting for what makes each teacher unique! This blog illustrates how MSRD works through five parts:

Preliminaries to Understand MSRD
Method of MSRD
Simulation Performance of MSRD
Table Tennis Real-World Experiment of MSRD
Conclusion

Preliminaries

In this section, we do a quick overview of preliminaries.

What is Reinforcement Learning?

We briefly recap the principals of Markov Deicison Process (MDP) and Reinforcement Learning (RL) in this section via a robot navigation task example (shown in Figure 1).

Robot Maze Example — This figure depicts a robot navigation task example.

In this example, the task for the robot is to navigate to the star (right bottom corner) starting from the top left corner. Mathematically, we could model the process as an MDP described by a 6-tuple $(S,A,R,T,\gamma,\rho_0)$.

$S$ represents the state space. In our example, a state $s\in S$ could be which grid the agent is standing on.
$A$ represents the action space. In the example, an action $a\in A$ could be trying to move to one of the four adjacent grid.
$\gamma\in (0,1)$ is the temporal discount factor, representing the relative preference towards instant reward versus long-term reward.
$R(s,a)$ represents the reward after executing action $a$ in the state $s$. In some cases, $R(s,a)$ could be simplified as $R(s)$. In our example, $R(s)$ is $0$ everywhere except for $+1$ for the star position.
$T(s,a,s^\prime)$ is the transition probability to $s^\prime$ from state $s$ after taking action $a$. For example, for the initial state and action “go right”, the agent will move one step right with probability $1$ if the robot is prefect. But if the robot is imperfect (which is always the case!), it is possible that the robot will instead go down with a probability $0.1$.
$\rho_0(s)$ is the initial state distribution. In the example case, $\rho_0(s)$ has probability $1$ to be in the top left corner.

The standard goal of reinforcement learning is to find the optimal policy $\pi^*(a|s)$ that maximizes the discounted future reward, shown in Equation \ref{eq:rl_objecive}.
\begin{align}
J(\pi)=\mathbb{E}_{\tau\sim \pi}\left[\sum_{i=0}^T{\gamma^tR(s_t,a_t)}\right]
\label{eq:rl_objecive}
\end{align}
$\tau=(s_0,a_0,\cdots,s_T,a_T)$ denotes a sequence of states and actions induced by the policy and dynamics: $s_1\sim\rho_0(s)$, $a_t\sim\pi(\cdot |s_t)$, $s_{t+1}\sim T(s_t,a_t,\cdot)$, $t\in\{1,2,\cdots,T\}$.

We instead consider a more general maximum entropy objective which augments the standard objective with an entropy bonus to favor stochastic policies and to encourage exploration during optimization by [Zierbart 2008], shown in Equation \ref{eq:rl_me_objective}.
\begin{align}
J(\pi)=\mathbb{E}_{\tau\sim \pi}\left[\sum_{i=0}^T{\gamma^tR(s_t,a_t)}+\alpha H(\pi(\cdot | s_t))\right]
\label{eq:rl_me_objective}
\end{align}

What is Inverse Reinforcement Learning?

Inverse Reinforcement Learning (IRL) considers an MDP sans reward function ($M\backslash R$) with the goal being to infer reward function $R(s,a)$ given a set of demonstrated trajectories $\mathcal{D}=\{\tau_1, \tau_2, \cdots, \tau_N\}$. For example, in Figure 1, we draw two demonstrated trajectories $\tau_1$ and $\tau_2$. A typical assumption for IRL is that demonstrated trajectories are optimal, or at least near-optimal or stochastic optimal. In the case of our example ($R(s)$ is $0$ everywhere except for $+1$ for the star position), both trajectories are optimal, i.e., obtaining the $+1$ reward with the least time.

Adversarial Inverse Inforcement Learning (AIRL) casts the problem of LfD in a generative adversarial framework, learning a discriminator, $D$, to distinguishes between experts and a deceiving generator, $\pi$, that learns to imitate the expert. This framework follows Maximum-Entropy IRL’s assumption that trajectory likelihood is proportional to the exponential of trajectory rewards, i.e. $p(\tau)\propto \sum_{t=1}^T{\gamma^tr_t}$. The Discriminator $D$ is defined in Equation \ref{eqn:D_and_f}, where $f_\theta(\tau)$ is the learnable reward function and $\pi$ is the current policy. $D$ is updated by minimizing its cross entropy loss to distinguish expert trajectories from generator policy rollouts (Equation \ref{eqn:airl_loss}). Generator $\pi$ is trained via RL (Equation \ref{eq:rl_me_objective}) to maximize the pseudo-reward function given by $f(s,a)$.
\begin{equation}
D_\theta(s,a)=\frac{\exp(f_\theta(s,a))}{\exp(f_\theta(s,a))+\pi(a|s)}
\label{eqn:D_and_f}
\end{equation}

\begin{align}
L_D=-\mathbb{E}_{\tau\sim\mathcal{D}, (s,a)\sim\tau}[&\log D(s,a)]-\mathbb{E}_{\tau\sim\pi, (s,a)\sim\tau}[1-\log D(s,a)]
\label{eqn:airl_loss}
\end{align}

TL;DR: IRL algorithms (including AIRL) learn a reward function that could explain demonstration behavior. Further, IRL trains a policy that maximizes the learned reward, which should behave similarly to the demonstration.

Multi-Strategy Reward Distillation Method

In MSRD, we consider heterogeneous demonstration dataset, $\mathcal{D}^{(i)}=\{\tau_1^{(i)},\tau_2^{(i)}, \cdots, \tau_M^{(i)}\}$, where $i\in\{1,2,\cdots, N\}$ is strategy index, $M$ is the number of demonstration trajectories for one strategy, and $N$ is the number of strategies. In our robot navigation example, $\mathcal{D}^{(1)}=\{\tau_1^{(1)}\}, \mathcal{D}^{(2)}=\{\tau_1^{(2)}\}$, i.e., there are two different strategies, and each strategy has one demonstration.

The core assumption of MSRD is fairly straightforward: the reward function in each demonstration is optimizing is a linear combination of the task reward ($R^{(0)}$) and the strategy-only reward as given by Equation \ref{eq:msrd_assumption}. We call $\widetilde{R}^{(i)}$ strategy-only reward. When strategy-only reward is combined with task reward, it becomes strategy-combined reward ($R^{(i)}$), meaning one wants to finish the task as well as maintaining the strategy.
\begin{align}
R^{(i)}(\cdot) &= R^{(0)}(\cdot) + \alpha_i \widetilde{R}^{(i)}(\cdot)
\label{eq:msrd_assumption}
\end{align}

Our approach is unique in that it shares the task reward with all the demonstrations while having the ability to tease out demonstrator-specific nuances (i.e., heterogeneity).

msrd_diagram — This diagram shows how MSRD learns.

To infer the shared task reward function, we propose utilizing network distillation to distill common knowledge from each separately learned strategy-combined reward $R^{(i)}$ to the task reward function $R^{(0)}$. The learning diagram is shown in Figure 2. We also want to regularize $R^{(i)}$ to be close to $R^{(0)}$, since we have the prior knowledge that despite heterogeneous strategic preferences, all experts should still be prioritizing optimizing the task reward $R^{(0)}$ to achieve at least near-optimal performance. We propose to regularize the expected L2-norm of the difference between the reward functions, as shown in Equation \ref{eqn:l2}, in which $\pi^{(i)}$ is the optimal policy under reward function $R_{\theta_i}^{(i)}$.
\begin{align}
\label{eqn:l2}
L_\text{reg}=\mathbb{E}_{(s,a)\sim \pi^{(i)}}\left({\left\lvert\left\lvert R_{\theta_i}^{(i)}(s,a)-R_{\theta_0}^{(0)}(s,a)\right\rvert\right\rvert}_2\right)
\end{align}
Note that we are using an index both on $\theta$ and $R$ to denote that each strategy-combined reward $R^{(i)}$ has its own reward parameters, and that these are approximated by separate neural networks with parameters $\theta_i$ for each strategy and $\theta_0$ for the task reward. There is no parameter sharing between strategy and task reward.

To avoid computational cost to fully optimize $\pi^{(i)}$ and $R^{(i)}$, we apply an iterative reward function and policy training schedule similar to AIRL. Combining AIRL’s objective (Equation \ref{eqn:airl_loss}) with the distillation objective, we want to maximize $L_D$ in Equation \ref{eqn:vanilla_distillation}.
\begin{align}
\label{eqn:vanilla_distillation}
L_D=\sum_{i=1}^N&\left[{\mathbb{E}_{(s,a)\sim \tau_j^{(i)}\sim \mathcal{D}^{(i)}}{\log D_{\theta_i}(s,a)}}
+ {\mathbb{E}_{(s,a)\sim \pi^{(i)}}{\log (1-D_{\theta_i}(s,a))}}
– {\mathbb{E}_{(s,a)}\left({\left\lvert\left\lvert R_{\theta_i}^{(i)}(s,a)-R_{\theta_0}^{(0)}(s,a)\right\rvert\right\rvert}_2\right)}\right]
\end{align}
Yet, while Equation \ref{eqn:vanilla_distillation} should be able to distill the shared reward into $R_{\theta_0}^{(0)}$, the distillation is inefficient as $R_{\theta_0}^{(0)}$ will work as a strong regularization for $R_{\theta_i}^{(i)}$ before successful distillation.

Instead, our structure in Equation \ref{eq:msrd_assumption} allows for a two-column (reference) re-parameterization, speeding up knowledge transfer and making the learning process easier. Combining Equation \ref{eq:msrd_assumption} and Equation \ref{eqn:vanilla_distillation}, we arrive at Equation \ref{eqn:reparameterized_distillation}.

\begin{align}
\label{eqn:reparameterized_distillation}
L_D&=\sum_{i=1}^N\left[{\mathbb{E}_{(s,a)\sim \tau_j^{(i)}\sim \mathcal{D}^{(i)}}{\log D_{\theta_i,\theta_0}(s,a)}}
+ {\mathbb{E}_{(s,a)\sim \pi^{(i)}}{\log \left(1-D_{\theta_i,\theta_0}(s,a)\right)}}
– \alpha_i{\mathbb{E}_{(s,a)}\left({\left\lvert\left\lvert \widetilde{R}_{\theta_i}^{(i)}(s,a)\right\rvert\right\rvert}_2\right)}\right]
\end{align}

The key difference between Equation \ref{eqn:reparameterized_distillation} and Equation \ref{eqn:vanilla_distillation} is that $D$ depends on both $R_{\theta_0}^{(0)}$ and $\widetilde{R}_{\theta_i}^{(i)}$ instead of separate $R_{\theta_i}^{(i)}$. Thus, $R_{\theta_0}^{(0)}$ directly updates from the discriminator’s loss rather than waiting for knowledge to be learned by a strategy-combined reward and subsequently distilled into a task reward.

Further, the last term of Equation \ref{eqn:reparameterized_distillation} reduces to a simple L2-regularization on strategy-only reward’s output, weighted by $\alpha_i$. This formulation provides us with a new view to interpret the relative weights of the strategy-only reward $\alpha_i$: the larger $\alpha_i$ is, the more the strategy-only reward will influence the strategy-combined reward. Therefore, we will have higher regularization to account for possible overwhelming of the task reward function. Comparing Equation \ref{eqn:reparameterized_distillation} and \ref{eqn:airl_loss}, we could interpret MSRD in another view: optimizing $\theta_i$ only via IRL objective results in a combination of task and strategy reward, and adding regularization on the strategy reward will encourage to encode only necessary information in $\theta_i$ and share more knowledge in $\theta_0$.

Simulation Results

We tested MSRD on two simulated environments: inverted pendulum and hopper. The goal of the inverted pendulum task is to balance a pendulum on a cart by moving the cart left/right. The reward for the inverted pendulum is defined as the negative absolute value of the pendulum angle from the upright position. The objective of hopper is to control 3-DoF joints to move forward. The reward for hopper is defined as the speed at which it moves forward. We removed termination judgments to gain flexibility in behaviors.

We first generated a variety of demonstrations to emulate heterogeneous strategies that humans will apply in solving problems for our virtual experiments via “DIAYN + Extrinsic Reward” and a novel method called “KL-Encouraged + Extrinsic Reward”. The two methods essentially encourage the policies trained to be different from each other while still accomplishing the task goal. We generated 20 different strategies for Inverted Pendulum and 2 most significant strategies for hopper, shown in Figure 3 and Figure 4.

inverted_pendulum_skill_17 — These gif show different strategies of Inverted Pendulum.

inverted_pendulum_skill_6 — These gif show different strategies of Inverted Pendulum.

hopper_hop — These gif show different strategies of Hopper.

hopper_crawl — These gif show different strategies of Hopper.

Hypotheses

We explore two hypotheses to elucidate MSRD’s advantage on recovering both the latent task reward (the essential goal of the demonstrators) and the means by which the task is accomplished (i.e., the strategy):

H1: The task reward learned by MSRD has a higher correlation with the true task reward than AIRL.
H2: Strategy-only reward learned by MSRD has a higher correlation with true strategic preferences than AIRL.

We assessed both hypotheses quantitatively and qualitatively for the simulation environments as the ground-truth reward functions are available.

To test H1, we constructed a dataset of trajectories that have various task performances. We then evaluated the reward function learned by AIRL and MSRD on the trajectories, comparing estimated vs. ground-truth rewards. We show a correlation of estimated rewards and ground-truth task rewards in Figure 5. The task reward function learned through MSRD has a higher correlation with the ground-truth reward function (0.99 and 0.94) versus AIRL (0.51 and 0.89) for each domain, respectively. AIRL’s reward function overfits to some strategies and mixes the task reward with that strategy-only reward, making its estimation unreliable for other strategies’ trajectories.

msrd_task_reward_correlation_vs_airl — This figure shows the correlation between ground-truth and estimated task reward, normalized to $[0,1]$ for inverted pendulum (left) and hopper(right). $r$ is the correlation coefficient.

To test H2, we calculated the correlations of MSRD’s strategy-only rewards with the true strategic preferences and compared that with the correlation of AIRL’s rewards when AIRL is trained on each individual strategy. In simulated domains, true strategic preferences are available. Correlations of both methods for all strategy rewards in inverted pendulum are shown in Figure 6. A paired t-test shows that MSRD achieves a statistically significantly higher correlation (M = 0.779, SD = 0.239) for the strategy rewards versus AIRL (M = 0.635, SD = 0.324) trained separately for each strategy, $t(19) = 1.813$, $p = 0.0428$ (one-tailed). A Shapiro-Wilk test showed the residuals were normally distributed ($p = 0.877$). For the hopper domain, MSRD achieved $0.85$ and $0.93$ correlation coefficient for the hop and crawl strategy, compared with AIRL’s $0.80$ and $0.82$ respectively.

msrd_strategy_reward_comparison_vs_airl — This figure shows the correlation between ground-truth and estimated strategy reward on Inverted Pendulum.

We further show qualitative results for the learned strategy reward. For Inverted Pendulum, we fix three dimensions (cart position, cart velocity and pendulum angular velocity) to zero and investigate the reward change within the one remaining dimension (pendulum angle). The relationship between rewards and pendulum angles in task and strategy reward functions are illustrated in Figure 7, in which the task reward function reaches its peak when the angle of the pendulum is near zero. This precisely recovers the task reward. For strategy-only reward functions, strategy 13 encourages the pendulum to lean left, while strategy 4 encourages the policy to tilt the pendulum right. Therefore, strategy-only rewards learned by MSRD captures specific preferences within demonstrations. Figure 7 also shows the magnitude of the task reward is larger than the strategy reward, which affirms our expectation that an emphasis is being put towards accomplishing the task.

inverted_pendulum_reward_illustration — This figure illustrates the learned strategy reward on Inverted Pendulum.

In the hopper environment, it is harder to visualize the reward landscape due to a high-dimensional observation space and a lack of interpretability of states. Therefore, instead of visualizing a reward curve, we evaluate the estimated strategy-only reward on trajectories from both strategies. Figure 8 shows that when given a hopping trajectory, the hop strategy-only reward function gives higher reward for that behavior than crawl strategy-only reward function. Similarly, in the crawl trajectory case, the crawling strategy-only reward gives a higher value than the hop strategy-only reward. Therefore, the strategy-only reward function recovered by MSRD gives a higher reward to the corresponding behavior than the other strategy-only reward function, thus providing encouragement to the policy towards the intended behavior.

hopper_strategy_reward_illustration — This figure shows the strategy rewards evaluated on different strategy demonstrations on Hopper.

These results across our simulated environments show our algorithms’ success in both task reward recovery and strategy reward decomposition. This capability is a novel contribution to the field of LfD in that we are able to tease out strategies from the underlying task and effectively learn policies that can reproduce both the strategy and a well-performed policy for the underlying task.

Robot Table Tennis Results

We tested MSRD on its capability to learn various table tennis strokes from expert human demonstration. Participants were asked to kinetically teach a robot arm to hit an incoming ping pong ball using three different stroke strategies: push, slice, and topspin. The change in angle and the upward/downward motion of the paddle throughout the policy trajectory are key factors in the display of different strategies (as these induce spin). Push is associated with a small change in angle as it is not attempting to add any spin onto the ball. Slice places a backspin on the ball, and thus the angle of the paddle will quickly tilt up. Conversely, to achieve a topspin on the ball, the associated trajectory has a quick upward motion.

After just 30 minutes of training on each strategy, the robot was able to learn to strike $83\%$ of the fed balls, shown in Figure 9. The robot learned to perform all strategies, and the robot’s best strategy, topspin, resulted in $90\%$ of balls returned successfully.

push_1 — These gif show three strategies of Table Tennis strikes (from top to bottom: push, slice, topspin).

push_2 — These gif show three strategies of Table Tennis strikes (from top to bottom: push, slice, topspin).

Conclusion

We explored the important case of IRL when demonstrators provide heterogeneous, strategy-based demonstrations besides seeking to maximize a latent task reward. By explicitly modeling the task and strategy reward separately, and jointly learning both, our algorithm is able to recover more robust task rewards, discover unique strategy rewards, and imitate strategy-specific behaviors.