Optimization is of paramount importance in every machine learning model, optimization is the ultimate goal when you're trying to train any machine learning model. Amongst the vast number of algorithms, Gradient Descent is a very popular one, because of its simplicity. Variations of this algorithm are popularly used across mathematics and machine learning associated fields and is the foundation of machine learning as we know today.

Defining Gradient Descent

Gradient descent is a first-order iterative optimization algorithm for finding the minimum of an objective function.

At a theoretical level, gradient descent is an algorithm that minimizes functions. Given a function defined by a set of parameters, gradient descent starts with an initial set of parameter values and iteratively moves toward a set of parameter values that minimize the function. This iterative minimization is achieved using calculus, taking steps in the negative direction of the function gradient.

If it's too much to comprehend, do not worry, we've got it covered. Fundamentally gradient descent is what allows a machine learning model to learn from data. The general idea behind gradient descent is 'instead of the immediately guessing the best solution to given objective function, we try to guess an initial solution an then iteratively do this until we reach a point of minima.

In this post we will try to understand gradient descent by building from the basics, bottom up.

Understanding Linear Regression

What is Linear Regression

Suppose you have a set of points that represent certain values in a two continuous variable relationship, and you wish to draw a line that fits these points, this line can then help us predict the value of one of these variables (usually called dependent variable) based on the value of the other variable (called explanatory variable). Let us consider the following data which is a csv file with 100 entries of 2 variable.

A scatterplot can be a helpful tool in determining the strength of the relationship between two variables. A scatter plot of this data looks something like this

Scatter Plot

A linear regression line has an equation of the form \(Y = w_1X + w_0\), where \(X\) is the explanatory variable and \(Y\) is the dependent variable. The slope of the line is \(w_1\), and \(w_0\) is the intercept (the value of \(y\) when \(x = 0\). To find the best line for our data, we need to find the best set of \(w_0\) and \(w_1\) values. We use the letter w because we think of the coefficients as weights; the value of \(y\) is changed by changing the relative
weight of one term or another. We’ll define w to be the vector \([w_0, w_1]\), and define

\(h_w(x) = w_1x + w_0 \tag{1}\)

The task of finding the \(h_w\) that best fits these data is called linear regression. To fit a line to the data, all
we have to do is find the values of the weights \([w_0, w_1]\), that minimize the empirical loss. So what we now need is a metric of how good the fit is. This metric is called a loss or error, and we use a loss function which describes the loss (or cost) associated with all possible decisions. It is traditional to use the squared loss function, \(L_2\), summed over all the training examples, which can be defined as:

Loss(h_w) & = \sum_{j=0}^N L ( y_j, h_w(x_j) )
& = \sum_{j=0}^N L ( y_j - h_w(x_j) )^2 \
& = \sum_{j=0}^N L ( y_j - (w_1x_j + w_0) )^2

Gauss showed that if the \(y_j\) values have normally distributed noise, then the most likely values of \(w_1\) and \(w_0\) are obtained by minimizing the sum of the squares of the errors.

So we defined our squared loss function as

$$Loss(h_w) = \sum_{j=0}^N L ( y_j - (w_1x_j + w_0) )^2\tag{2}$$

Let's break this down, in essence this equation takes the difference between expected value of the variable and the predicted value, and squares this difference. This squared value becomes our loss, and our target when fitting a line is to minimize this loss.

Minimizing The Loss Function

The fundamental idea which makes calculus useful in understanding problems of maximizing and minimizing things is that at a peak of the graph of a function, or at the bottom of a trough, the tangent is horizontal. That is, the derivative \(f′(x_o)\) is \(0\) at points \(x_o\) at which \(f(x_o)\) is a maximum or a minimum.

A minimum of a function \(f\) is a point that is the lowest in its immediate neighbourhood. If \(f\) is a function of one variable, then \(df/dx\) is zero at a minimum; if \(f\) is a function of more than one variable, then all its partial derivatives are zero at a minimum.

A minimum of a function of two variables corresponds, in a surface plot of f, to the bottom of a depression.

So essentially the sum $$\sum_{j=0}^N L ( y_j - (w_1x_j + w_0) )^2$$ is minimized when its partial derivatives with respect to \(w_o\) and \(w_1\) are zero:

$$\frac {\partial}{ \partial w_0} \sum_{j=0}^N L ( y_j - (w_1x_j + w_0) )^2 = 0\tag{3}$$

$$\frac {\partial}{ \partial w_1} \sum_{j=0}^N L ( y_j - (w_1x_j + w_0) )^2 = 0\tag{4}$$

These equations have a unique solution

$$w_1 = \frac {N(\sum x_jy_j) - (\sum x_j)(\sum y_j)}{N(\sum x_j^2) - (\sum x_j)^2}$$

$$w_0 = \bigl(\sum y_j - w_1(\sum x_j)\bigr)/N$$

In some sense that’s the end of the story for linear models; if we need to fit lines to data, we apply the above equation.

Example of a Weight Space

When plotted as a weight space we see that the loss function is convex, this is true for every linear regression problem with an \(L_2\) loss function, and implies that there are no local minima.

The Hill Climb Search

Blindfold Climb

Imagine a hiker, blind folded trying to reach the top of the hill, for the hiker to reach the target, the only percept he has is the direction of his movement i.e. is he climbing up or down. So the hiker keeps walking till he climbs, if at any step the hiker gets a step that goes down the slope, he can guess that he has reached the top.

But there is a problem with this assumption, hill climb search is simply a loop that continually moves in the direction of increasing value—that is, uphill. It terminates when it reaches a “peak” where no neighbor has a higher value, it does not look ahead beyond the immediate neighbors of the current state. Hill climb is a greedy algorithm, although greedy algorithms can perform really well simply because it can easily eliminate the bad options, but most greedy algorithm can get stuck at a local optimal, a simple analogy from our previous example can be a local maximum i.e. a peak that is higher than each of its neighboring states but lower than the global maximum i.e the actual peak.

Blindfold Climb with Local Maximum

A Random-Restart Hill-Climb can somewhat solve problems associated to the traditional hill-climb. The success of hill climbing depends very much on the shape of the state-space landscape, or in our case the nature of the weight space. The take away from this part is that a hill-climbing algorithm that never makes “downhill” moves toward a neighbouring state with lower value (or higher cost) is guaranteed to be incomplete, because it can get stuck on a local maximum.

Descending The Gradient

Often in general optimization problems, the equation defining minimum loss will have no closed-form solution like we saw when we tried to minimize the loss function. Instead, we will face a general optimization search problem in a continuous weight space, such problems can be tackled using hill climb algorithm that uses the gradient as the guiding compass. We choose any starting point in weight space—here, a point in the (\(w_0\), \(w_1\)) plane—and then move to a neighboring point that is downhill, repeating until we converge on the minimum possible loss. The Pseudocode is as follows

\(w\) ← any point in the parameter space
loop until convergence do
  for each \(w_i\) in \(w\) do
    \(w_i ← w_i − \alpha \frac {\partial}{ \partial w_i} Loss(w)\)

The parameter \(\alpha\), which we called the step size, is usually called the learning rate when we are trying to minimize loss in a learning problem. It can be a fixed constant, or it can decay over time as the learning process proceeds.
The partial derivative of our loss function

$$\frac {\partial}{ \partial w_j} Loss(w) = 2(y − hw(x)) ×\frac {\partial}{ \partial w_j} (y − h_w(x))\tag{5}$$

Working out the partial derivatives of our loss functions (Eqn. 3 and 4) we get:

$$\frac {\partial}{ \partial w_0} Loss(w) = −2(y − h_w(x)) \tag{6}$$

$$\frac {\partial}{ \partial w_1} Loss(w) = −2(y − h_w(x)) × x\tag{7}$$

Plugging eqn. 6 and 7 in our pseudocode above, we get:

$$w_0 ← w_0 + \alpha (y − h_w(x))\tag{8}$$

$$w_1 ← w_1 + \alpha (y − h_w(x)) × x\tag{9}$$

Here the updates make intuitive sense: if \(h_w(x) > y\), i.e., the output of the hypothesis is too large, reduce \(w_0\) a bit, and reduce \(w_1\) if \(x\) was a positive input but increase \(w_1\) if \(x\) was a negative input.

This covers how we update our weight for one data point or training example, for \(N\) data points we can define our update function by minimizing the sum of individual losses for each data point or training example. Since the derivative of the sum of two functions is the sum of the derivatives of the two functions we have the liberty to simply add up the losses which are basically partial derivatives. Therefore for \(N\) data points our update function looks like:

$$w_0 ← w_0 + \alpha \sum_{j}^N(y_j − h_w(x_j))\tag{10}$$

$$w_1 ← w_1 + \alpha \sum_{j}^N(y_j − h_w(x_j)) × x_j\tag{11}$$

These updates constitute the batch gradient descent learning rule for univariate linear regression. Convergence to the unique global minimum is guaranteed (as long as we pick \(\alpha\) small enough) but may be very slow: we have to cycle through all the training data for every step, and there may be many steps. The following python code represents one batch or one iteration of gradient descent.

def stepGradient(b_current, m_current, points, learningRate):
    b_gradient = 0
    m_gradient = 0
    N = float(len(points))
    for i in range(0, len(points)):
        b_gradient += -(2/N) * (points[i].y - ((m_current*points[i].x) + b_current))
        m_gradient += -(2/N) * points[i].x * (points[i].y - ((m_current * points[i].x) + b_current))
    new_b = b_current - (learningRate * b_gradient)
    new_m = m_current - (learningRate * m_gradient)
    return [new_b, new_m]

This following is a visualization of running batch gradient descent on our sample data.

Linear Regression in Action

A sample code for gradient descent can be found here. The output we get when running the example code is

Starting gradient descent at b = 0, m = 0, error = 5565.10783448
After 1000 iterations b = 0.0889365199374, m = 1.47774408519, error = 112.614810116

Well, if you're expecting the end of the article here, you might as well, read it again. We still haven't solved the problem of local minima yet.

Gradient descent is an iterative optimization algorithm for finding a local minimum of differentiable functions. At each iteration, gradient descent operates by moving the current solution in the direction of the negative gradient of the function (the direction of "steepest descent").

In essence gradient descent is kind of a greedy approach. The gradient is nothing but the direction along which the loss function decreases the fastest, and descending this means finding the locally optimal choice at each iteration. For a general loss function there is no guarantee that the optimization algorithm will find a global minimum, unless the loss function is of a specific form, a convex function is an ideal choice because for a convex function there is only one minima, i.e. the global one. So if we try to see the broader picture here, it's the loss function and not actually the algorithm that brings us the problem of local optimum. However this does not mean we cannot tweak our algorithms to increase the chances off getting a global minimum in this case). This is where simulated annealing comes into picture.

Simulated Annealing

Simulated annealing (SA) is a probabilistic technique for approximating the global optimum of a given function, to understand simulated annealing imagine the task of getting a ping-pong ball into the deepest crevice in a bumpy surface. If we just let the ball roll, it will come to rest at a local minimum. If we shake the surface, we can bounce the ball out of the local minimum. The trick is to shake just hard enough to bounce the ball out of local minima but not hard enough to dislodge it from the global minimum.

The simulated-annealing solution is to start by shaking hard (i.e., at a high temperature) and then gradually reduce the intensity of the shaking (i.e., lower the temperature). The underlying basic step of a simulated-annealing algorithm is quite similar to hill climbing, i.e. picking the best move that moves toward the peak. Instead of picking the best move, however, it picks a random move. If the move improves the situation, it is always accepted. Otherwise, the algorithm accepts the move with some probability less than 1. The probability decreases exponentially with the “badness” of the move The probability also decreases as the “temperature” T goes down: “bad” moves are more likely to be allowed at the start when T is high, and they become more unlikely as T decreases. If the schedule lowers T slowly enough, the algorithm will find a global optimum with probability approaching 1.

In essence the noise introduced by the random selection in some sense gives us the freedom to take a wrong step that might lead us the the local optimal instead of global and thereby increasing the chances of getting the global optimum.

The benefit we get from introducing randomness is exploited by a version of gradient descent called as stochastic gradient descent . The stochastic gradient algorithm behaves like a simulated annealing algorithm, where the learning rate of the stochastic gradient is related to the temperature of simulated annealing. The randomness or noise introduced by stochastic gradient allows to escape from local minima to reach a better minimum

Stochastic Gradient Descent

In stochastic gradient descent, we consider only a single training point at a time, taking a step after each one using Equation 5. So essentially stochastic gradient descent computes the gradient using a single sample. Most applications of SGD actually use a minibatch of several samples as performance of stochastic gradient descent is better. The somewhat noisier gradient calculated using the reduced number of samples tends to jerk the model out of local minima into a region that hopefully is more optimal. Single samples are really noisy, while minibatches tend to average a little of the noise out. Thus, the amount of jerk is reduced when using minibatches. A good balance is struck when the minibatch size is small enough to avoid some of the poor local minima, but large enough that it doesn't avoid the global minima or better-performing local minima.

In stochastic gradient descent the parameters are estimated for every observation, as opposed the whole sample in regular gradient descent (batch gradient descent). This is what gives it a lot of randomness. The path of stochastic gradient descent wanders over more places, and thus is more likely to "jump out" of a local minimum, and find a global minimum.

With a fixed learning rate \(\alpha\), however, it does not guarantee convergence; it can oscillate around the minimum without settling down. In some cases, as we see later, a schedule of decreasing learning rates (as in simulated annealing) does guarantee convergence. Even though stochastic gradient descent is more of a solution to the not-so-good performance of gradient descent, it can help us solve the issue of local minima too.


Gradient Descent Example for Linear Regression
Artificial Intelligence: A Modern Approach, Stuart Russell and Peter Norvig
Stochastic Gradient Learning in Neural Networks, L´eon Bottou

More Reading

Convex Optimization, Stephen Boyd and Lieven Vandenberghe
Global descent replaces gradient descent to avoid local minima problem in learning with artificial neural networks