• Skip to primary navigation
  • Skip to content
  • Skip to primary sidebar
  • Skip to secondary sidebar

CORE Robotics Lab

  • Home
  • Publications
  • People
  • Robots
  • Blog
  • Contact Us

Zac Chen

Joint Goal and Strategy Inference across Heterogeneous Demonstrators via Reward Network Distillation

August 16, 2020 by Zac Chen, Rohan Paleja, Leng and Matthew Gombolay

“Learning from Demonstration” (LfD) techniques aim to enable robots to be taught by non-roboticists how to perform tasks, such as folding laundry, driving a car, or performing surgery. For decades, people have relied on the incorrect assumption that every teacher performs tasks similarly, i.e. everyone folds laundry or performs surgery in a similar fashion. However, this assumption is often not true. Humans (and even experts) typically adopt heuristics (i.e., “mental shortcuts”) to solve challenging optimization problems due to a human’s limited working memory. These highly-refined strategies present heterogeneity in behavior across human task demonstrators. Only recently have researchers sought to relax this assumption of demonstrator homogeneity. However, even these new approaches do not explicitly tease out precisely what was similar in all the demonstrations versus what was unique to each teacher. Rather, some previous methods divide the demonstrations into clusters beforehand and perform LfD for homogeneous demonstrations within one cluster. However, this limits the number of data accessible to LfD algorithms, resulting in worse learning.

We take a new perspective in LfD by modeling every human teacher as having a shared notion of the desired task (a common goal) along with a teacher-specific notion (e.g., strategy, or style) for performing the task. Accordingly, we present a novel method, Multi-Strategy Reward Distillation (MSRD), a novel Inverse Reinforcement Learning (IRL) technique that automatically finds and takes advantage of the common goal shared by all teachers (e.g., perform the task well) while teasing out and accounting for what makes each teacher unique! This blog illustrates how MSRD works through five parts:

  1. Preliminaries to Understand MSRD
  2. Method of MSRD
  3. Simulation Performance of MSRD
  4. Table Tennis Real-World Experiment of MSRD
  5. Conclusion

Preliminaries

In this section, we do a quick overview of preliminaries.

What is Reinforcement Learning?

We briefly recap the principals of Markov Deicison Process (MDP) and Reinforcement Learning (RL) in this section via a robot navigation task example (shown in Figure 1).

Robot Maze Example
This figure depicts a robot navigation task example.



In this example, the task for the robot is to navigate to the star (right bottom corner) starting from the top left corner. Mathematically, we could model the process as an MDP described by a 6-tuple $(S,A,R,T,\gamma,\rho_0)$.

  • $S$ represents the state space. In our example, a state $s\in S$ could be which grid the agent is standing on.
  • $A$ represents the action space. In the example, an action $a\in A$ could be trying to move to one of the four adjacent grid.
  • $\gamma\in (0,1)$ is the temporal discount factor, representing the relative preference towards instant reward versus long-term reward.
  • $R(s,a)$ represents the reward after executing action $a$ in the state $s$. In some cases, $R(s,a)$ could be simplified as $R(s)$. In our example, $R(s)$ is $0$ everywhere except for $+1$ for the star position.
  • $T(s,a,s^\prime)$ is the transition probability to $s^\prime$ from state $s$ after taking action $a$. For example, for the initial state and action “go right”, the agent will move one step right with probability $1$ if the robot is prefect. But if the robot is imperfect (which is always the case!), it is possible that the robot will instead go down with a probability $0.1$.
  • $\rho_0(s)$ is the initial state distribution. In the example case, $\rho_0(s)$ has probability $1$ to be in the top left corner.

The standard goal of reinforcement learning is to find the optimal policy $\pi^*(a|s)$ that maximizes the discounted future reward, shown in Equation \ref{eq:rl_objecive}.
\begin{align}
J(\pi)=\mathbb{E}_{\tau\sim \pi}\left[\sum_{i=0}^T{\gamma^tR(s_t,a_t)}\right] \label{eq:rl_objecive}
\end{align}
$\tau=(s_0,a_0,\cdots,s_T,a_T)$ denotes a sequence of states and actions induced by the policy and dynamics: $s_1\sim\rho_0(s)$, $a_t\sim\pi(\cdot |s_t)$, $s_{t+1}\sim T(s_t,a_t,\cdot)$, $t\in\{1,2,\cdots,T\}$.

We instead consider a more general maximum entropy objective which augments the standard objective with an entropy bonus to favor stochastic policies and to encourage exploration during optimization by [Zierbart 2008], shown in Equation \ref{eq:rl_me_objective}.
\begin{align}
J(\pi)=\mathbb{E}_{\tau\sim \pi}\left[\sum_{i=0}^T{\gamma^tR(s_t,a_t)}+\alpha H(\pi(\cdot | s_t))\right] \label{eq:rl_me_objective}
\end{align}

What is Inverse Reinforcement Learning?

Inverse Reinforcement Learning (IRL) considers an MDP sans reward function ($M\backslash R$) with the goal being to infer reward function $R(s,a)$ given a set of demonstrated trajectories $\mathcal{D}=\{\tau_1, \tau_2, \cdots, \tau_N\}$. For example, in Figure 1, we draw two demonstrated trajectories $\tau_1$ and $\tau_2$. A typical assumption for IRL is that demonstrated trajectories are optimal, or at least near-optimal or stochastic optimal. In the case of our example ($R(s)$ is $0$ everywhere except for $+1$ for the star position), both trajectories are optimal, i.e., obtaining the $+1$ reward with the least time.

Adversarial Inverse Inforcement Learning (AIRL) casts the problem of LfD in a generative adversarial framework, learning a discriminator, $D$, to distinguishes between experts and a deceiving generator, $\pi$, that learns to imitate the expert. This framework follows Maximum-Entropy IRL’s assumption that trajectory likelihood is proportional to the exponential of trajectory rewards, i.e. $p(\tau)\propto \sum_{t=1}^T{\gamma^tr_t}$. The Discriminator $D$ is defined in Equation \ref{eqn:D_and_f}, where $f_\theta(\tau)$ is the learnable reward function and $\pi$ is the current policy. $D$ is updated by minimizing its cross entropy loss to distinguish expert trajectories from generator policy rollouts (Equation \ref{eqn:airl_loss}). Generator $\pi$ is trained via RL (Equation \ref{eq:rl_me_objective}) to maximize the pseudo-reward function given by $f(s,a)$.
\begin{equation}
D_\theta(s,a)=\frac{\exp(f_\theta(s,a))}{\exp(f_\theta(s,a))+\pi(a|s)}
\label{eqn:D_and_f}
\end{equation}

\begin{align}
L_D=-\mathbb{E}_{\tau\sim\mathcal{D}, (s,a)\sim\tau}[&\log D(s,a)]-\mathbb{E}_{\tau\sim\pi, (s,a)\sim\tau}[1-\log D(s,a)] \label{eqn:airl_loss}
\end{align}

TL;DR: IRL algorithms (including AIRL) learn a reward function that could explain demonstration behavior. Further, IRL trains a policy that maximizes the learned reward, which should behave similarly to the demonstration.

Multi-Strategy Reward Distillation Method

In MSRD, we consider heterogeneous demonstration dataset, $\mathcal{D}^{(i)}=\{\tau_1^{(i)},\tau_2^{(i)}, \cdots, \tau_M^{(i)}\}$, where $i\in\{1,2,\cdots, N\}$ is strategy index, $M$ is the number of demonstration trajectories for one strategy, and $N$ is the number of strategies. In our robot navigation example, $\mathcal{D}^{(1)}=\{\tau_1^{(1)}\}, \mathcal{D}^{(2)}=\{\tau_1^{(2)}\}$, i.e., there are two different strategies, and each strategy has one demonstration.

The core assumption of MSRD is fairly straightforward: the reward function in each demonstration is optimizing is a linear combination of the task reward ($R^{(0)}$) and the strategy-only reward as given by Equation \ref{eq:msrd_assumption}. We call $\widetilde{R}^{(i)}$ strategy-only reward. When strategy-only reward is combined with task reward, it becomes strategy-combined reward ($R^{(i)}$), meaning one wants to finish the task as well as maintaining the strategy.
\begin{align}
R^{(i)}(\cdot) &= R^{(0)}(\cdot) + \alpha_i \widetilde{R}^{(i)}(\cdot)
\label{eq:msrd_assumption}
\end{align}

Our approach is unique in that it shares the task reward with all the demonstrations while having the ability to tease out demonstrator-specific nuances (i.e., heterogeneity).

msrd_diagram
This diagram shows how MSRD learns.


To infer the shared task reward function, we propose utilizing network distillation to distill common knowledge from each separately learned strategy-combined reward $R^{(i)}$ to the task reward function $R^{(0)}$. The learning diagram is shown in Figure 2. We also want to regularize $R^{(i)}$ to be close to $R^{(0)}$, since we have the prior knowledge that despite heterogeneous strategic preferences, all experts should still be prioritizing optimizing the task reward $R^{(0)}$ to achieve at least near-optimal performance. We propose to regularize the expected L2-norm of the difference between the reward functions, as shown in Equation \ref{eqn:l2}, in which $\pi^{(i)}$ is the optimal policy under reward function $R_{\theta_i}^{(i)}$.
\begin{align}
\label{eqn:l2}
L_\text{reg}=\mathbb{E}_{(s,a)\sim \pi^{(i)}}\left({\left\lvert\left\lvert R_{\theta_i}^{(i)}(s,a)-R_{\theta_0}^{(0)}(s,a)\right\rvert\right\rvert}_2\right)
\end{align}
Note that we are using an index both on $\theta$ and $R$ to denote that each strategy-combined reward $R^{(i)}$ has its own reward parameters, and that these are approximated by separate neural networks with parameters $\theta_i$ for each strategy and $\theta_0$ for the task reward. There is no parameter sharing between strategy and task reward.

To avoid computational cost to fully optimize $\pi^{(i)}$ and $R^{(i)}$, we apply an iterative reward function and policy training schedule similar to AIRL. Combining AIRL’s objective (Equation \ref{eqn:airl_loss}) with the distillation objective, we want to maximize $L_D$ in Equation \ref{eqn:vanilla_distillation}.
\begin{align}
\label{eqn:vanilla_distillation}
L_D=\sum_{i=1}^N&\left[{\mathbb{E}_{(s,a)\sim \tau_j^{(i)}\sim \mathcal{D}^{(i)}}{\log D_{\theta_i}(s,a)}}
+ {\mathbb{E}_{(s,a)\sim \pi^{(i)}}{\log (1-D_{\theta_i}(s,a))}}
– {\mathbb{E}_{(s,a)}\left({\left\lvert\left\lvert R_{\theta_i}^{(i)}(s,a)-R_{\theta_0}^{(0)}(s,a)\right\rvert\right\rvert}_2\right)}\right] \end{align}
Yet, while Equation \ref{eqn:vanilla_distillation} should be able to distill the shared reward into $R_{\theta_0}^{(0)}$, the distillation is inefficient as $R_{\theta_0}^{(0)}$ will work as a strong regularization for $R_{\theta_i}^{(i)}$ before successful distillation.

Instead, our structure in Equation \ref{eq:msrd_assumption} allows for a two-column (reference) re-parameterization, speeding up knowledge transfer and making the learning process easier. Combining Equation \ref{eq:msrd_assumption} and Equation \ref{eqn:vanilla_distillation}, we arrive at Equation \ref{eqn:reparameterized_distillation}.

\begin{align}
\label{eqn:reparameterized_distillation}
L_D&=\sum_{i=1}^N\left[{\mathbb{E}_{(s,a)\sim \tau_j^{(i)}\sim \mathcal{D}^{(i)}}{\log D_{\theta_i,\theta_0}(s,a)}}
+ {\mathbb{E}_{(s,a)\sim \pi^{(i)}}{\log \left(1-D_{\theta_i,\theta_0}(s,a)\right)}}
– \alpha_i{\mathbb{E}_{(s,a)}\left({\left\lvert\left\lvert \widetilde{R}_{\theta_i}^{(i)}(s,a)\right\rvert\right\rvert}_2\right)}\right] \end{align}

The key difference between Equation \ref{eqn:reparameterized_distillation} and Equation \ref{eqn:vanilla_distillation} is that $D$ depends on both $R_{\theta_0}^{(0)}$ and $\widetilde{R}_{\theta_i}^{(i)}$ instead of separate $R_{\theta_i}^{(i)}$. Thus, $R_{\theta_0}^{(0)}$ directly updates from the discriminator’s loss rather than waiting for knowledge to be learned by a strategy-combined reward and subsequently distilled into a task reward.

Further, the last term of Equation \ref{eqn:reparameterized_distillation} reduces to a simple L2-regularization on strategy-only reward’s output, weighted by $\alpha_i$. This formulation provides us with a new view to interpret the relative weights of the strategy-only reward $\alpha_i$: the larger $\alpha_i$ is, the more the strategy-only reward will influence the strategy-combined reward. Therefore, we will have higher regularization to account for possible overwhelming of the task reward function. Comparing Equation \ref{eqn:reparameterized_distillation} and \ref{eqn:airl_loss}, we could interpret MSRD in another view: optimizing $\theta_i$ only via IRL objective results in a combination of task and strategy reward, and adding regularization on the strategy reward will encourage to encode only necessary information in $\theta_i$ and share more knowledge in $\theta_0$.

Simulation Results

We tested MSRD on two simulated environments: inverted pendulum and hopper. The goal of the inverted pendulum task is to balance a pendulum on a cart by moving the cart left/right. The reward for the inverted pendulum is defined as the negative absolute value of the pendulum angle from the upright position. The objective of hopper is to control 3-DoF joints to move forward. The reward for hopper is defined as the speed at which it moves forward. We removed termination judgments to gain flexibility in behaviors.

We first generated a variety of demonstrations to emulate heterogeneous strategies that humans will apply in solving problems for our virtual experiments via “DIAYN + Extrinsic Reward” and a novel method called “KL-Encouraged + Extrinsic Reward”. The two methods essentially encourage the policies trained to be different from each other while still accomplishing the task goal. We generated 20 different strategies for Inverted Pendulum and 2 most significant strategies for hopper, shown in Figure 3 and Figure 4.

inverted_pendulum_skill_17inverted_pendulum_skill_6inverted_pendulum_skill_4inverted_pendulum_skill_3
These gif show different strategies of Inverted Pendulum.

hopper_hophopper_crawl
These gif show different strategies of Hopper.

Hypotheses

We explore two hypotheses to elucidate MSRD’s advantage on recovering both the latent task reward (the essential goal of the demonstrators) and the means by which the task is accomplished (i.e., the strategy):

H1: The task reward learned by MSRD has a higher correlation with the true task reward than AIRL.
H2: Strategy-only reward learned by MSRD has a higher correlation with true strategic preferences than AIRL.

We assessed both hypotheses quantitatively and qualitatively for the simulation environments as the ground-truth reward functions are available.

To test H1, we constructed a dataset of trajectories that have various task performances. We then evaluated the reward function learned by AIRL and MSRD on the trajectories, comparing estimated vs. ground-truth rewards. We show a correlation of estimated rewards and ground-truth task rewards in Figure 5. The task reward function learned through MSRD has a higher correlation with the ground-truth reward function (0.99 and 0.94) versus AIRL (0.51 and 0.89) for each domain, respectively. AIRL’s reward function overfits to some strategies and mixes the task reward with that strategy-only reward, making its estimation unreliable for other strategies’ trajectories.

msrd_task_reward_correlation_vs_airl
This figure shows the correlation between ground-truth and estimated task reward, normalized to $[0,1]$ for inverted pendulum (left) and hopper(right). $r$ is the correlation coefficient.

To test H2, we calculated the correlations of MSRD’s strategy-only rewards with the true strategic preferences and compared that with the correlation of AIRL’s rewards when AIRL is trained on each individual strategy. In simulated domains, true strategic preferences are available. Correlations of both methods for all strategy rewards in inverted pendulum are shown in Figure 6. A paired t-test shows that MSRD achieves a statistically significantly higher correlation (M = 0.779, SD = 0.239) for the strategy rewards versus AIRL (M = 0.635, SD = 0.324) trained separately for each strategy, $t(19) = 1.813$, $p = 0.0428$ (one-tailed). A Shapiro-Wilk test showed the residuals were normally distributed ($p = 0.877$). For the hopper domain, MSRD achieved $0.85$ and $0.93$ correlation coefficient for the hop and crawl strategy, compared with AIRL’s $0.80$ and $0.82$ respectively.

msrd_strategy_reward_comparison_vs_airl
This figure shows the correlation between ground-truth and estimated strategy reward on Inverted Pendulum.

We further show qualitative results for the learned strategy reward. For Inverted Pendulum, we fix three dimensions (cart position, cart velocity and pendulum angular velocity) to zero and investigate the reward change within the one remaining dimension (pendulum angle). The relationship between rewards and pendulum angles in task and strategy reward functions are illustrated in Figure 7, in which the task reward function reaches its peak when the angle of the pendulum is near zero. This precisely recovers the task reward. For strategy-only reward functions, strategy 13 encourages the pendulum to lean left, while strategy 4 encourages the policy to tilt the pendulum right. Therefore, strategy-only rewards learned by MSRD captures specific preferences within demonstrations. Figure 7 also shows the magnitude of the task reward is larger than the strategy reward, which affirms our expectation that an emphasis is being put towards accomplishing the task.

inverted_pendulum_reward_illustration
This figure illustrates the learned strategy reward on Inverted Pendulum.

In the hopper environment, it is harder to visualize the reward landscape due to a high-dimensional observation space and a lack of interpretability of states. Therefore, instead of visualizing a reward curve, we evaluate the estimated strategy-only reward on trajectories from both strategies. Figure 8 shows that when given a hopping trajectory, the hop strategy-only reward function gives higher reward for that behavior than crawl strategy-only reward function. Similarly, in the crawl trajectory case, the crawling strategy-only reward gives a higher value than the hop strategy-only reward. Therefore, the strategy-only reward function recovered by MSRD gives a higher reward to the corresponding behavior than the other strategy-only reward function, thus providing encouragement to the policy towards the intended behavior.

hopper_strategy_reward_illustration
This figure shows the strategy rewards evaluated on different strategy demonstrations on Hopper.

These results across our simulated environments show our algorithms’ success in both task reward recovery and strategy reward decomposition. This capability is a novel contribution to the field of LfD in that we are able to tease out strategies from the underlying task and effectively learn policies that can reproduce both the strategy and a well-performed policy for the underlying task.

Robot Table Tennis Results

We tested MSRD on its capability to learn various table tennis strokes from expert human demonstration. Participants were asked to kinetically teach a robot arm to hit an incoming ping pong ball using three different stroke strategies: push, slice, and topspin. The change in angle and the upward/downward motion of the paddle throughout the policy trajectory are key factors in the display of different strategies (as these induce spin). Push is associated with a small change in angle as it is not attempting to add any spin onto the ball. Slice places a backspin on the ball, and thus the angle of the paddle will quickly tilt up. Conversely, to achieve a topspin on the ball, the associated trajectory has a quick upward motion.

After just 30 minutes of training on each strategy, the robot was able to learn to strike $83\%$ of the fed balls, shown in Figure 9. The robot learned to perform all strategies, and the robot’s best strategy, topspin, resulted in $90\%$ of balls returned successfully.

push_1push_2
slice_1slice_2
topspin_1topspin_2
These gif show three strategies of Table Tennis strikes (from top to bottom: push, slice, topspin).

Conclusion

We explored the important case of IRL when demonstrators provide heterogeneous, strategy-based demonstrations besides seeking to maximize a latent task reward. By explicitly modeling the task and strategy reward separately, and jointly learning both, our algorithm is able to recover more robust task rewards, discover unique strategy rewards, and imitate strategy-specific behaviors.

Filed Under: Research Summaries

Bootcamp Summer 2020 Week 1 – Gradient Descent

June 17, 2020 by Zac Chen and Matthew Gombolay

Gradient Descent (GD) is an intuitive optimization method. GD has become one of the most popular optimization methods in recent years due to its nice theoretical properties, effectiveness in real-world applications, and uncanny ability to train large-scale deep learning models. GD is commonly used in deep learning for finding neural network parameters that best capture patterns in the data. In this blog, we introduce three different flavors of GD, regular GD, Stochastic GD (SGD), and Minibatch GD (MbGD). We analyze these variants theoretically as well as empirically, and discuss which one is most appropriate for different scenarios! Finally, we discuss how GD can be improved or “accelerated” by adding “momentum”.

This blog is divided into four parts:

  1. Introduction of GD, SGD, and MbGD
  2. Analytical results
  3. Empirical results
  4. Accelerated GD

What is GD, SGD, and MbGD?

Let’s first look at a simple machine learning scenario. We have a dataset $\mathcal{D}=\{(x_i,y_i),\ i\in [n]\}$, where $(x_i,y_i)$ is  “example” number $i$, $x_i$ is the input to a model we want to train and $y_i$ is the “label” we want our model to learn to predict from $x_i$. For instance, $x_i$ might be the date of the year, and $y_i$ might be the expected rainfall you would receive that day. In this simplified example, we assume $x_i$ is a scalar instead of a vector, but the concept and the math work for the vector case (where $x_i$ contains several features) too.

To allow us to visualize the model, let’s assume we are using a linear model parameterized by $\theta$ shown in Equation \ref{f}.
\begin{align}
f_\theta(x_i)=\theta x_i
\label{f}
\end{align}
The metric we use to judge our model (i.e., a “loss function”) is the Mean Squared Error (MSE) loss (Equation \ref{L}), which calculates the squared error for each prediction and averages over all data points.
\begin{align}
L(\theta)=-\frac{1}{2n}\sum_{i=1}^n(y_i-f_\theta(x_i))^2
\label{L}
\end{align}
We want to find the best model, $\theta^*$, according to our loss function (Equation \ref{loss}).
\begin{align}
\theta^*=\arg\min_\theta L(\theta)
\label{loss}
\end{align}

\(L\) is a quadratic function of the model parameter \(\theta\), illustrated in Figure 1. The problem then is to find the critical point, $\theta^*$, that is located at the minimum of this curve. To find the critical point, we set the first derivative of $L$ equals to $0$, i.e. $\frac{\partial L(\theta)}{\partial \theta}=0$. In the case of a simple linear model, we can directly solve for model parameters $\theta^*$. However, in deep learning, such direct methods do not exist. Instead, we need something else. In this tutorial, that “something else” is GD.

gd_diagram
This figure illustrates Gradient Descent (GD) process.

GD

In GD, we perform three steps:

  1. Calculate the gradient of our model parameters.
  2. Use the gradient to update model parameters.
  3. Repeat steps 1 and 2 until a pre-set tolerance is achieved.

We use $\nabla$ as the gradient operator and it is defined in Equation \ref{gradient}.
\begin{align}
\nabla_\theta=\left[\begin{matrix}
\frac{\partial}{\partial \theta_1}\\
\frac{\partial}{\partial \theta_2}\\
\cdots\\
\frac{\partial}{\partial \theta_m}\\
\end{matrix}\right] \label{gradient}
\end{align}
First, we calculate the gradient of $L$ w.r.t. $\theta$ in MSE loss (Equation \ref{L}), shown in Equation \ref{gradient_of_L}.
\begin{align}
\nabla_\theta L=-\frac{1}{n}\sum_{i=1}^n(y_i-f_\theta(x_i))\nabla_\theta f_\theta(x_i)\bigg\rvert_{x_i}
\label{gradient_of_L}
\end{align}
In our case of linear model (Equation \ref{f}), we could write the gradient as in Equation \ref{L_linear}.
\begin{align}
\begin{split}
\nabla_\theta L&=-\frac{1}{n}\sum_{i=1}^n(y_i-\theta x_i)x_i\\
&=\frac{1}{n}\sum_{i=1}^nx_i^2\theta-\frac{1}{n}\sum_{i=1}^nx_iy_i
\end{split}
\label{L_linear}
\end{align}
The gradient $\nabla_\theta L$ is a linear function of $\theta$, as shown in Figure 2.

This figure depicts an example, uni-dimensional gradient for Equation \ref{L_linear}.

Second, we would like to use the calculated gradient to update our model parameters. As we want to minimize the loss function, GD follows the negative gradient direction. We update our parameters by this gradient multiplied by a small step size $\alpha$ (a hyperparameter), as shown in Equation \ref{negative_gradient_step}.
\begin{align}
\theta^{(i)}\leftarrow \theta^{(i-1)}-\alpha \nabla_\theta L(\theta^{(i-1)})
\label{negative_gradient_step}
\end{align}
Empirically, we choose $\alpha$ to be ${10}^{-5}-{10}^{-3}$. Utilizing an $\alpha$ value that is too small will result in slower learning and require more iterations of (1) and (2). An $\alpha$ value that is too large may be even more detrimental. It may lead to behavior that causes the model parameters $\theta$ to oscillate or even diverge.

Finally, GD continues to iterate until a pre-set tolerance is achieved (norm of gradient is small, function value is small, or iteration number is large, etc).

If the function to be optimized (in our case, the loss function, $L$) is convex w.r.t. the parameters $\theta$, GD is guaranteed to converge to a global minimum given alpha is small enough. However, if this assumption does not hold, the model parameters could become stuck in a local minimum of the loss function $L$. For problems that are non-convex, we often perform GD several times with sets of randomly initialized theta, hoping to find the global minimum.

Why take the negative gradient direction?

In order to minimize the loss in Figure 1, we look at two cases: where $\theta >4.5$ and $\theta < 4.5$. If $\theta$ is greater than $4.5$, we need to move left along the loss curve to decrease our loss value. If $\theta$ is less than $4.5$, we should move right.

From Figure 2, we can see that $\nabla_\theta L>0$ when $\theta>4.5$ and $\nabla_\theta L <0$ when $\theta<4.5$. Utilizing the negative of these gradients in our update rules allows us to improve theta, i.e. decreases our loss function.

This idea of “taking a step” comes from using a first order Taylor Expansion of $L(\theta)$, mathematically shown in Equation \ref{taylor_expansion}.
\begin{align}
L(\theta^\prime)\approx L(\theta)+(\theta^\prime-\theta)\nabla_\theta L(\theta)
\label{taylor_expansion}
\end{align}
Therefore, if our goal is to make $L(\theta^\prime)$ smaller, we can take $\theta^\prime=\theta-\alpha\nabla_\theta L(\theta)$, as
\begin{align}
\begin{split}
L(\theta-\alpha\nabla_\theta L(\theta))&\approx L(\theta)-\alpha\nabla_\theta L(\theta)\nabla_\theta L(\theta) \\
&= L(\theta)-\alpha(\nabla_\theta L(\theta))^2\\
&< L(\theta).\quad ((\nabla_\theta L(\theta))^2\geq 0, \alpha>0)
\end{split}
\end{align}

Accordingly, GD is guaranteed to reduce the loss function if $\alpha$ is infinitely small. As a result, GD will go down the hill, as shown in Figure 1.

SGD

Stochastic Gradient Descent (SGD) has one fundamental difference compared to standard GD: instead of using the entire dataset to calculate the gradient $\nabla_\theta L$ in each step, we sample only one data point from the dataset and use that specific data to calculate $\nabla_\theta L$. Speficaly, in our example, the gradient of the loss function computed over one datapoint is shown in Equation \ref{gradient_SGD}.
\begin{align}
\begin{split}
\nabla_\theta L(\theta^{(i-1)})&=-(y_j-\theta^{(i-1)} x_j) x_j\\
&=x_j^2\theta^{(i-1)}-x_jy_j\\
& j\sim \mathcal{U}\{1,n\}
\end{split}
\label{gradient_SGD}
\end{align}
Here, $\mathcal{U}\{1,n\}$ represents discrete uniform distribution.

Because we only compute our gradient update for one data point, SGD requires less computation (Floating Point Operations Per Second, FLOPS) than GD in each iteration. However, since we are calculating the gradient from just one sample of data, and the data may be noise (typical in real-world datasets), SGD may take steps in the wrong direction.

MbGD

Minibatch Gradient Descent represents a middle ground between GD and SGD, sampling a small subset of examples from our dataset to calculate $\nabla_\theta L$:
\begin{align}
\begin{split}
\nabla_\theta L(\theta^{(i-1)})&=-\frac{1}{n^\prime}\sum_{j\in\mathcal{B^{(i)}}}(y_j-\theta^{(i-1)} x_j) x_j\\
&=\frac{1}{n^\prime}\sum_{j\in\mathcal{B^{(i)}}}x_j^2\theta^{(i-1)}-x_jy_j
\end{split}
\end{align}
In this equation, $\mathcal{B}=\{i_1,\cdots,i_{n^\prime}\}, i_j\sim \mathcal{U}\{1,n\}$. $n^\prime$ is the size of the minibatch. The size of minibatch is a hyperparameter you choose.

Visualization of GD, MbGD and SGD

The optimization process difference between the three variants of GD is illustrated in Figure 3. From GD to MBGD and SGD, the updating process becomes noisier and takes more iterations to converge. However, the computation time for each iteration is lower.

GDvsMbGDvsSGD
GD vs. MbGD vs. SGD

Comparison of Teoretical Properties

Here, we investigate the pros and cons for each gradient descent mechanism.

This table summarizes pros and cons of each GD variant.
GD MbGD SGD
Noise in Updates Low Mid High
FLOPS/iter High Mid Low
RAM usage High Mid Low

For large datasets such as Image-Net, it is impossible for GD to run on most computers as the memory requirement is extremely large (i.e., you cannot load the entire dataset, $\mathcal{D}$, into RAM).

We might wonder: since SGD only takes low FLOPS/iter but has high gradient noise, will it slow down SGD’s overall convergence?

Researchers have provided the answer (reference) ($k$ represents the algorithm iteration count):

This table displays the convergence rate of each GD variant under different assumptions.
GD SGD
Convex Function $\mathcal{O}(\frac{1}{\sqrt{k}})$ $\mathcal{O}(\frac{1}{\sqrt{k}})$
Lipschitz Gradient + Convex Function $\mathcal{O}(\frac{1}{k})$ $\mathcal{O}(\frac{1}{\sqrt{k}})$
Strongly Convex Function $\mathcal{O}(\gamma^k)$ $\mathcal{O}(\frac{1}{k})$

Here, we measure convergence by $\mathbb{E}[L(\theta^{(k)})-L(\theta^*)]$ in which $\theta^*$ represents the best parameters.

Note that for a loss function $L$ to be convex, the second derivitve with respect to $\theta$ for ALL $\theta$ and $x_i\in \mathcal{D}$ must be positive (for differentible functions), as shown in Equation \ref{second_direvative_greater_than_0}.
\begin{align}
\frac{\partial^2 L(\theta)}{\partial \theta^2}\geq 0,\quad\forall\theta, x_i \in\mathcal{D}
\label{second_direvative_greater_than_0}
\end{align}
Performing GD on convex functions is guaranteed to find the global optimum, $\theta^*$. However, it is generally not guaranteed when functions are non-convex.

Lipschitz Gradient means the gradient of loss $L$ w.r.t. $\theta$ cannot change dramatically locally, as described in Equation \ref{lipschitz_gradient}.
\begin{align}
||\nabla L(\theta)-\nabla L(\theta^\prime)\leq \Omega ||\theta-\theta^\prime||,\quad\exists\Omega,\forall \theta,\theta^\prime
\label{lipschitz_gradient}
\end{align}

Strongly Convex means the function is not only convex but also has at least certain curvature as quadratic functions (reference), as shown in Equation \ref{strongly_convex}.
\begin{align}
\langle\nabla L(\theta)-\nabla L(\theta^\prime),\theta-\theta^\prime\rangle\geq \gamma ||\theta-\theta^\prime||^2,\quad\exists\gamma>0,\forall \theta,\theta^\prime
\label{strongly_convex}
\end{align}

Although these assumptions do not always hold for general loss functions and model choices, we can conclude that SGD’s convergence rate is not as good as GD in the latter two cases, but their convergence is actually the same under convex function assumption! In practice, we cannot say much about the theoretical properties of GD vs SGD, but we at least can go forward knowing that SGD performs comparably on convex functions and is often more feasible when working with Big Data.

Take Away

Although SGD seems to have more noisy updates, SGD’s convergence rate on convex function is the same as GD. Further, SGD requires fewer FLOPS/iter. Therefore, SGD would require fewer total FLOPS to find our best model, $\theta^*$. SGD for the win!

Empirical Comparison

Here, we show an example optimization of the three methods. In line with our theoretical results, they have similar convergence in terms of iterations, and GD > MbGD > SGD in terms of FLOPS.

Note that SGD does have higher noise during the process, and MbGD has similar smooth updates as GD, i.e., the usage of minibatch does help stabilize the update.

This figure displays empricial comparison between GD, MbGD and SGD.

Take Away

When the dataset is small, GD is preferred due to its low-noise properties. When the dataset gets large enough, GD becomes infeasible. In this case, SGD is preferred if the data noise is not significant and the noisy updates will not ruin the optimization process, otherwise, MbGD is preferred. We provide our code on Github.

Accelerating GD and SGD

Several variants of GD (such as Momentum GD, RMSProp, Adam) all utilize the core idea of momentum. What is momentum then?

We introduce the initial momentum as
\begin{align}
m^{(0)}=0.
\end{align}

For each iteration, instead of using the gradient to directly update $\theta$, we use the gradient to update the momentum, as shown in Equation \ref{momentum}.
\begin{align}
m^{(i)}=\beta m^{(i-1)}+\nabla_\theta L(\theta^{(i-1)})
\label{momentum}
\end{align}
$\beta\in[0,1)$ is a hyperparameter that you tune, and $\beta=0$ recovers basic GD.
And then use the momentum to update $\theta$ according to Equation \ref{theta_update_momentum}.
\begin{align}
\theta^{(i)}=\theta^{(i-1)}-\alpha m^{(i)}
\label{theta_update_momentum}
\end{align}

What is momentum essentially doing, and when is it useful?

Consider the snowboarding-in-a-half-pipe-like optimization problem depicted in Figure 5.

NoMomentum
This figure shows a snowboard-like optimization problem.

Often in these situations, standard GD will oscillate back and forth because the gradient direction has larger value on vertical direction than horizontal direction. However, we could see that each gradient step still has a small right movement, slowly pushing us to the optimum.

Momentum techniques utilize such phenomenon and gathers the downward forces while canceling out the horizontal forces, resulting as the result shown before.

WithMomentum
This figure displays momentum acceleration on the snowboard-like optimization problem.

Another case that momentum helps is when the loss function has a flat region (plateau) as shown in Figure 7.

This figure shows an example of plateau function.

In this case, gradient descent updates on $\theta$ will get slower in the middle plateau area, even equaling $0$ in extreme cases. However, as long as we accumulate enough momentum before the plateau, we could rush over the plateau, avoiding the small-gradient problem. Momentum can also help push past local minimums so that we can find optimal parameters even if our function is non-convex.

Theoretically, it could be proven that a similar variant of Momentum GD, Nesterov’s accelerated gradient descent, could achieve $O(\frac{1}{k^2})$ convergence instead of $O(\frac{1}{k})$ (reference).

Take Away

Momentum is useful for solving many optimization problems. However, adding momentum in requires extra hyperparameter tuning for $\beta$.

Conclusion

  1. GD, MbGD, SGD converge at the same rate in terms of iterations.
  2. SGD requires fewer FLOPS but is noisy.
  3. Parallization could help MbGD and GD do better in terms of wall clock time.
  4. Momentum can help but needs careful hyperparameter tuning.
  5. Many famous optimizers such as Adam use momentum and fancy tricks but it is essentially SGD!

Filed Under: Bootcamp

Primary Sidebar

Secondary Sidebar

People

  • Matthew Gombolay
  • Nakul Gopalan
  • Laura Strickland
  • Zheyuan Wang
  • Rohan Paleja
  • Mariah Schrum
  • Esi Seraj
  • Andrew Silva
  • Erin Hedlund
  • Pradyumna Tambwekar
  • Zac Chen
  • Manisha Natarajan
  • Nina Moorman

Copyright © 2023 · eleven40 Pro on Genesis Framework · WordPress · Log in