Blogs · Deep Learning · Optimization

Gradient Descent Algorithm and Its Variants!

The most important technique in neural networks (as of now)

2020.11.13 · 9 min read · by Zhenlin Wang · updated 2022-08-19

Overview of Gradient Descent

Optimization refers to the task of minimizing/maximizing an objective function \(f(x)\) parameterized by $x$. In machine/deep learning terminology, it’s the task of minimizing the cost/loss function $J(w)$ parameterized by the model’s parameters \(w \in \mathbb{R}^d\). Optimization algorithms (in case of minimization) have one of the following goals:

There are three kinds of optimization algorithms:

Gradient Descent is the most common optimization algorithm in machine learning and deep learning. It is a first-order optimization algorithm. This means it only takes into account the first derivative when performing the updates on the parameters. On each iteration, we update the parameters in the opposite direction of the gradient of the objective function $J(w)$ w.r.t to the parameters where the gradient gives the direction of the steepest ascent. The size of the step we take on each iteration to reach the local minimum is determined by the learning rate $\alpha$. Therefore, we follow the direction of the slope downhill until we reach a local minimum.

In this notebook, we’ll cover gradient descent algorithm and its variants: Batch Gradient Descent, Mini-batch Gradient Descent, and Stochastic Gradient Descent.

Let’s first see how gradient descent and its associated steps works on logistic regression before going into the details of its variants. For the sake of simplicity, let’s assume that the logistic regression model has only two parameters: weight $w$ and bias $b$.

  1. Initialize weight $w$ and bias $b$ to any random numbers.

  2. Pick a value for the learning rate $\alpha$. The learning rate determines how big the step would be on each iteration.

    • If $\alpha$ is very small, it would take long time to converge and become computationally expensive.

    • IF $\alpha$ is large, it may fail to converge and overshoot the minimum.

      Therefore, plot the cost function against different values of $\alpha$ and pick the value of $\alpha$ that is right before the first value that didn’t converge so that we would have a very fast learning algorithm that converges (Figure 1).

      Figure 2
    • The most commonly used rates are : 0.001, 0.003, 0.01, 0.03, 0.1, 0.3.

  3. Make sure to scale the data if it’s on very different scales. If we don’t scale the data, the level curves (contours) would be narrower and taller which means it would take longer time to converge (Figure 2).

    Figure 2
    • Scale the data to have $\mu = 0$ and $\sigma = 1$. Below is the formula for scaling each example:

      $$\frac{x_i - \mu}{\sigma}\tag{1} $$

  4. On each iteration, take the partial derivative of the cost function $J(w)$ w.r.t each parameter (gradient):

    $$\frac{\partial}{\partial w}J(w) = \nabla_w J\tag{2}$$

    $$\frac{\partial}{\partial b}J(w) = \nabla_b J\tag{3}$$

    The update equations are:

    $$w = w - \alpha \nabla_w J\tag{4}$$

    $$b = b - \alpha \nabla_b J\tag{5}$$

    • For the sake of illustration, assume we don’t have bias. If the slope of the current values of $w > 0$, this means that we are to the right of optimal $w^{\star}$. Therefore, the update will be negative, and will start getting close to the optimal values of $w^*$. However, if it’s negative, the update will be positive and will increase the current values of $w$ to converge to the optimal values of $w^{\star}$ (Figure 3):
    Figure 3
    • Continue the process until the cost function converges. That is, until the error curve becomes flat and doesn’t change.
    • In addition, on each iteration, the step would be in the direction that gives the maximum change since it’s perpendicular to level curves at each step.

Now let’s discuss the three variants of gradient descent algorithm. The main difference between them is the amount of data we use when computing the gradients for each learning step. The trade-off between them is the accuracy of the gradient versus the time complexity to perform each parameter’s update (learning step).

Batch Gradient Descent

Batch Gradient Descent is when we sum up over all examples on each iteration when performing the updates to the parameters. Therefore, for each update, we have to sum over all examples:

$$w = w - \alpha \nabla_w J(w)\tag{6}$$

{% codeblock manual gradient update lang:python%} for i in range(num_epochs): grad = compute_gradient(data, params) params = params - learning_rate * grad {% endcodeblock %}

The main advantages:

The main disadvantages:

Mini-Batch Gradient Descent

Instead of going over all examples, Mini-batch Gradient Descent sums up over lower number of examples based on batch size. Therefore, learning happens on each mini-batch of $b$ examples:

$$w = w - \alpha \nabla_w J(x^{\{i:i + b\}}, y^{\{i: i + b\}}; w)\tag{7}$$

{% codeblock manual gradient update minibatch lang:python%} for i in range(num_epochs): np.random.shuffle(data) for batch in radom_minibatches(data, batch_size=32): grad = compute_gradient(batch, params) params = params - learning_rate * grad {% endcodeblock %}

The batch size is something we can tune. It is usually chosen as power of 2 such as 32, 64, 128, 256, 512, etc. The reason behind it is because some hardware such as GPUs achieve better runtime with common batch sizes such as power of 2.

The main advantages:

The main disadvantages:

With large training datasets, we don’t usually need more than 2-10 passes over all training examples (epochs). Note: with batch size $b = m$, we get the Batch Gradient Descent.

Stochastic Gradient Descent

Instead of going through all examples, Stochastic Gradient Descent (SGD) performs the parameters update on each example $(x^i, y^i)$. Therefore, learning happens on every example:

$$w = w - \alpha \nabla_w J(x^i, y^i; w)\tag{7}$$

{% codeblock manual gradient update stochastic lang:python%} for i in range(num_epochs): np.random.shuffle(data) for example in data: grad = compute_gradient(example, params) params = params - learning_rate * grad {% endcodeblock %}

It shares most of the advantages and the disadvantages with mini-batch version. Below are the ones that are specific to SGD:

Below is a graph that shows the gradient descent’s variants and their direction towards the minimum:

Figure 5

As the figure above shows, SGD direction is very noisy compared to mini-batch.

Areas for advancement

Below are some challenges regarding gradient descent algorithm in general as well as its variants - mainly batch and mini-batch:

- The norm of the gradient $g^Tg$ is supposed to decrease slowly with each learning step because the curve is getting flatter and steepness of the curve will decrease. However, we see that the norm of the gradient is increasing, because of the curvature of the curve. Nonetheless, even though the gradients' norm is increasing, we're able to achieve a very low error rates (see figure 8).
- In small dimensions, local minimum is common; however, in large dimensions, saddle points are more common. Saddle point is when the function curves up in some directions and curves down in other directions. In other words, saddle point looks a minimum from one direction and a maximum from other direction (see figure 9). This happens when at least one eigenvalue of the hessian matrix is negative and the rest of eigenvalues are positive.
- As discussed previously, choosing a proper learning rate is hard. Also, for mini-batch gradient descent, we have to adjust the learning rate during the training process to make sure it converges to the local minimum and not wander around it. Figuring out the decay rate of the learning rate is also hard and changes with different datasets. - All parameter updates have the same learning rate; however, we may want to perform larger updates to some parameters that have their directional derivatives more inline with the trajectory towards the minimum than other parameters.