We all know Neural Networks are amazing, pretty good a doing 'Human-like' stuff, so let's skip the usual intro and jump to the real stuff.

*Note: In the course of this blog, I've made some assumptions and oversimplifications to give a relatively clearer picture of the concepts, however, all of it is cleared later in the article as we delve into much detail of the subject.*

## Introduction

Artificial Neural Networks are a class of a beautiful biologically-inspired programming paradigm which enables a computer to learn from observational data. It has some computational units that “fires” when a linear combination of its inputs exceeds some threshold. A neural network is actually a huge function of these computational units that takes in some input and gives some output. This huge function can be really complex and can easily have more than a thousand parameters which are weights and biases, this function is really good at picking up patterns in data we feed. These patterns help the neural network give us wonderful results. Neural networks are loosely based on the structure of the human brain, so let's start from there.

## A Biological Neuron

A neuron is a nervous cell, which is the basic functional building element of nervous system.

They have a cell body called soma, this encloses the nucleus of the cell. Various processes (appendages or protrusions) extend from the cell body. These include many short, branching processes, known as dendrites, and a separate process that is typically longer than the dendrites, known as the axon.

The neuron receives information through its dendrites, and can receive thousands of inputs. Depending on the kind of input whether excitatory or inhibitory it passes down an impulse down the axon. So essentially, a neuron takes in multiple inputs, based on the nature of input or 'value' of the input it outputs certain impulse or 'value'. This is the most fundamental kind of processing that takes place in a neuron, a complex network of this action is what makes learning happen.

How we humans learn is something that is not fully known, however, we can take some clues from Hebb's Rule. Donald Hebb is a Canadian psychologist, in his book *The Organization of Behavior* he stated, “When an axon of cell A is near enough cell B and repeatedly or persistently takes part in firing it, some growth process or metabolic change takes place in one or both cells such that A’s efficiency, as one of the cells firing B, is increased.” It’s often paraphrased as “Neurons that fire together wire together.”

Hebb's rule speaks about how activities between neuron affect the connection between them, more specifically how to update weight of neuronal connection

within neural network. The weight of connection between neurons is a function of the neuronal activity. It does have some drawbacks as a model, but that's something that not in the purview of this post.

## An Artificial Neuron

The first formal model of a neuron was proposed by Warren McCulloch and Walter Pitts in 1943. It looked a lot like the logic gates computers are made of. This McCulloch-Pitts neuron switches on when the number of its active inputs passes some threshold. It could simulate both AND and OR gates i.e. If the threshold is one, the neuron acts as an OR gate; if the threshold is equal to the number of inputs, as an AND gate. So essentially what a computer does can be achieved by a network of these neurons. But what it doesn't do is learn. This was possible by making the provision of variable weights between the neurons which is exactly what Frank Rosenblatt, a physicist from Cornell University did. This led to the invention of a perceptron in the year 1957.

The Navy revealed the embryo of an electronic computer today that it expects will be able to walk, talk, see, write, reproduce itself and be conscious of its existence.

— The New York Times, 1958

A perceptron receives multiple inputs, with each input there is a certain weight associated with it, this weight decides the nature of the connection. A positive weight represents an excitatory connection, and a negative weight an inhibitory one. The perceptron outputs 1 if the weighted sum of its inputs is above a threshold, and 0 if it’s below. By varying the weights and threshold, we can change the function that the perceptron computes. In a diagrammatic representation of a perceptron we can see how each part of this model corresponds to a part of the neuron:

A perceptron is simple linear classifier; mathematically it is a function that maps its input \(x\) (a real-valued vector) to an output value \(f(x)\) (a single binary value). This can be represented as:

$$f(x) =

\begin{cases}

1, & \text{if $w\cdot x + b > 0$} \\

0, & \text{otherwise}

\end{cases}$$

where \(w\) represents a vector of real-valued weights, \(w\cdot x\) is the dot product \(\sum_{i=1}^m w_i x_i\) where m is the number of inputs to the perceptron and b is the bias. Seems good so far? Here is where it get's dirty, this perceptron models the learning of only one neuron, a neural network is a highly complex system. In a multilayered system of neurons, all you can control is the input and all you see is the output, in this multilayered system with possibly hundreds of connections and correspondingly hundreds of weights o deal with, how do you know which connection is responsible for a wrong output, or any output for that matter. In the perceptron model, there’s no clear way to change the weights of the neurons in the “hidden” layers to reduce the errors made by the ones in the output layer. Every hidden neuron influences the output via multiple paths. A system of perceptron couldn't work, simply because we didn't know how to make it learn. Some inspiration from physics (See: The Hopfield Model) and later statistics (See: Boltzmann Machines) made things more awesome (Note: A lot more amazing things happened between then and today which is out of the scope of the blog post, but you can read it here).

It turns out that we can devise learning algorithms which can automatically tune the weights and biases of a network of artificial neurons. This tuning happens in response to external stimuli, without direct intervention by a programmer. We shall explore more about it in the next section.

## Neural Networks!

To understand neural networks, let's assume we have a magic algorithm that can update relations between neurons. So neural networks are models that can learn by analysing a lot of examples. A typical neural network can consist of thousands or even millions of neurons organised in the form of layers of nodes, a neuron from one layer receives input from every neuron from the previous layer. We can label these layers as input layer, hidden layers and output layer.

The reason I say hidden 'layers' and not 'layer' is that the number of these layers can be more than one, and in most complex problems it is more than one. Each layer in a way is in a way a non-linear function that recognises certain features. First, there is a form of a linear transformation followed by some squashing which is usually non-linear. This is roughly what happens within a layer. We will go into the specifics of that soon.

To help visualize what a neural does let's break a neural network down to its structural functionality and the learning process.

### Structural Functionality

Let's actually build a neural network that can recognise handwritten digits, we are going to use the MNIST database which is a popular dataset of handwritten digits, it has a training set of 60,000 images, and a test set of 10,000 images. Each image is of 28px by 28px and contains grey levels.

So the network we are going to build will take in the pixel values, that implies our input layer will have 28×28 = 784 neurons, each neuron corresponds to a pixel in the image. Now let's redefine our neuron for the moment, let's assume a neuron is nothing but a node in the network that holds a number. So talking about the 784 neurons, each neuron holds the grey value of each pixel. Now jumping to the last layer, it should probably have 10 neurons, each corresponding to a digit from 0 to 9. (We can have lesser neurons too, but for simplicity let's stick to one for each number).

So we now have an input and an output layer ready, let's talk about the hidden layers. We shall make one hidden layer with 16 neurons, this is an arbitrary choice and we can easily work with a different number of neurons. In practice, we cannot sum up the design process of hidden layers within this blog.

In this particular network the output from one layer is used as an input for next layer. Such a neural network is called a

feedforward neural network. This means that there is no way that information loops back to the network, i.e. information is always fed forward, never fed back. Having such a feedback mechanism certainly makes things a little more complicated because the output now is not only a function of weights and biases, but also the previous output, this does add some amazing abilities to the network, and we shall cover that in a future blog post.

Now we have a network that looks something like 784 ➔ 16 ➔ 10.

Let's try to understand how our network learns to recognize, we shall take the example of the digit 8. As mentioned previously each neuron in the input layer corresponds to each pixel in the image, it is simple to think of it as each neuron holding a number, this number is the brightness value of the pixel (0 for black pixels and 1 for white pixels) this number is its activation. Value 1 means, the neuron is "activated". So for the time being it's safe to assume a neuron as a node holding a number between 0 to 1.

The way a neural network operates is that activation from one layer affects the activation of the subsequent layer. So when we feed the image of 8 to the input layer, lighting up all the neurons with certain activation, the pattern of activation in this layer causes some specific pattern of activation in the next layer which then causes some pattern in the next layer which is the output layer, usually with a particular neuron with extremely high brightness. This brightest neuron in the output layer is the decision of the neuron as to what number we fed in is. The brighter this neuron, the higher the confidence is pertaining to the decision.

So let's take recognition to a human level, how do we know a certain digit is 8 or 9 or 2? We decompose the image into 'features' or 'components'. An eight is two loops stacked on top of each other similarly a four is three lines two vertical and one horizontal. Our best guess at what these hidden layers do is try to model these features to some specific set of neurons. Just like each neuron in the input layer corresponds to a pixel, there will be some specific neurons in the hidden that correspond to these components or features. So this means whenever we feed a 9 to the network, there are going to be some specific set of neurons in the hidden layer whose activation will be close to one. This can be neurons that get set off by any loopy pattern and neurons that are activated by vertical lines, hence generalizing a particular shape from the input image to a pattern it finds in the image. So the output can be known by just knowing which feature or component was found in the image.

Maybe for more complex problems like recognising alphabets, we may have more layers, what these additional layers will do is break the components to sub-components. So the input will be used to look for various sub-components, these sub-components will be collated to detect components and finally depending on what components make up a particular alphabet, we can get a good enough recognition.

So these hidden layers essentially work on finding a generalized model that activate certain neurons based on features, essentially feature extraction.

For this to happen every neuron must know the importance of every input it gets, i.e. the neuron should have some metric to tell them how relevant certain input is. This metric is called weight, every connection has a weight associated with it, every neuron has its own weight set of all the connections it receives as an input. Suppose we have a neuron that detects a horizontal line, let's try to device a weight space for this features, now every pixel that includes a white dash can have a positive weight (represented in green), to go a step further the pixels above and below it can be assigned a negative weight.

We have previously discussed that every neuron receives an input from every neuron in the previous layer. Every neuron receives a weighted sum of all the activations from the previous layer.

Consider this example neural network with configuration 5 ➔ 8 ➔ 2.

The neuron \(b_1\) receives input from \(a_1\), \(a_2\), \(a_3\), \(a_4\) and \(a_5\). Each connection has a weight associated to it, say \(w_1\), \(w_2\), \(w_3\), \(w_4\), \(w_5\). The neuron \(b_1\) thus receives a weighted sum of all activations from the first layer. So \(b_1\) for now looks like

$$b_1 = w_1a_1+w_2a_2+w_3a_3+w_4a_4+w_5a_5$$

This becomes the activation of \(b_1\), but remember an assumption we took a while ago, that a neuron is a node that holds number between 0 to 1, well it's true. The weighted sum of all the activations the neuron \(b_1\) received is definitely not between 0 and 1, so we need a function that essentially compresses our sum, or any weighted sum of activations between 0 and 1. This function is called an *activation function* of the neuron. For years we have used an activation function called *Sigmoid*.

It is essentially an 'S' shaped curve that drops off towards 0 as values approach a higher negative number and ascends towards 1 gradually as we approach higher positive values. Mathematically sigmoid is expressed as:

$$\sigma(x) = \frac{1}{1+e^{-x}}$$

Sigmoid is simple and straight-forward but has a few problems like vanishing gradient problem, output not being zero-centred, etc. Many other activation functions like tanh, ReLU, ELU etc. have replaced sigmoid in newer models. But let's stick with sigmoid for now. After applying this activation function we have a new value for \(b_1\)

$$b_1 = \sigma(w_1a_1+w_2a_2+w_3a_3+w_4a_4+w_5a_5)$$

Now suppose for some reasons we wish to shift our activation function left or right. But why would anyone want to do that right? Well the answer is to fit our data better. Let's understand with an example of a line represented as \(y=mx\) the line passes through the origin. It may not always be the case that this line would fit our data correctly hence we add a \(y\)-intercept to it, the equation becomes \(y=mx+b\). For similar reasons we would want to shift our activation function too, maybe when we want our neuron to activate when the weighted sum is more than 20 and not zero, what we can do is add '-20' to the weighted sum. This number is our *bias* (Read this forum post to understand why we need biases in context to activation functions). So the weights tell which pattern we should care about, and the bias tells how high the weighted sums of weights should be for the neuron to get meaningfully active.

Adding this bias into our function, we get

$$b_1 = \sigma(w_1a_1+w_2a_2+w_3a_3+w_4a_4+w_5a_5 - 20)$$

Now for a layer of neurons this can be represented simply as a matrix multiplication, we can organize all the activations from one layer as a vector in a column and the weights as a matrix where each row of that matrix corresponds to the connections between one layer and a particular neuron in the next layer, the bias can also be written as a column vector with each row corresponding to the bias for a particular neuron. Mathematically this matrix structure can be represented as

$$

\sigma\Biggl(\begin{bmatrix}

w_{1,1} & w_{1,2} & w_{1,3} & w_{1,4} & w_{1,5} & w_{1,6} & w_{1,7} & w_{1,8} \\

w_{2,1} & w_{2,2} & w_{2,3} & w_{2,4} & w_{2,5} & w_{2,6} & w_{2,7} & w_{2,8} \\

w_{3,1} & w_{3,2} & w_{3,3} & w_{3,4} & w_{3,5} & w_{3,6} & w_{3,7} & w_{3,8} \\

w_{4,1} & w_{4,2} & w_{4,3} & w_{4,4} & w_{4,5} & w_{4,6} & w_{4,7} & w_{4,8} \\

w_{5,1} & w_{5,2} & w_{5,3} & w_{5,4} & w_{5,5} & w_{5,6} & w_{5,7} & w_{5,8} \\

\end{bmatrix}

\begin{bmatrix}

a_1\\a_2\\a_3\\a_4\\a_5\\a_6\\a_7\\a_8

\end{bmatrix} +

\begin{bmatrix}

B_1\\B_2\\B_3\\B_4\\B_5

\end{bmatrix}\Biggr) =

\begin{bmatrix}

b_1\\b_2\\b_3\\b_4\\b_5

\end{bmatrix}

$$

For a general scenario, we can write this as

$$

\sigma\Biggl(\begin{bmatrix}

w_{1,1} & w_{1,2} & \cdots & w_{1,n} \\

w_{2,1} & w_{2,2} & \cdots & w_{2,5} \\

\vdots & \vdots & \ddots & \vdots \\

w_{k,1} & w_{k,2} & \cdots & w_{k,5} \\

\end{bmatrix}

\begin{bmatrix}

a^0_1\\a^0_2\\ \vdots \\a^0_n

\end{bmatrix} +

\begin{bmatrix}

b^1_1\\b^1_2\\ \vdots \\b^1_k

\end{bmatrix}\Biggr) =

\begin{bmatrix}

a^1_1\\a^1_2\\ \vdots \\a^1_k

\end{bmatrix}

$$

Here \(a^0\) represents the neurons from the first layer and subsequently \(a^1\) represents neuron from the next layer, with \(n\) and \(k\) being the number of neurons in these layer respectively. \(b^1\) represents the biases associated to each neuron in the first layer The forward transition from a layer \(N-1\) to the next layer \(N\) can be represented as one neat expressions

$$a^N=\sigma(Wa^{N-1} + b^N)$$

With this it becomes incredibly easy to write code for neural networks, all we need is a good enough mathematical library that can perform matrix multiplication. But what about the learning? There should be some mechanism for us to adjust the weights, our example has 784 neurons in the input layer, 16 in the subsequent layer and 10 in the output layer, this is 12,730 parameters to play with. The next section covers how we tune these 12,730 parameters or rather how the network tunes these parameters.

### Learning Process

So we have so far understood what activations, weights and biases are and what's the significance of all this, now is the part where we see how exactly are those parameters tuned to get the correct output.

The output layer has 10 neurons, hence the output we receive will be a vector for 10 numbers with values ranging from 0 to 1. For the digit 4, the output we expect is:

$$[0.0, 0.0, 0.0, 0.0, 1.0, 0.0, 0.0, 0.0, 0.0, 0.0]$$

Now let's assume we have started with random weights and biases, when we try to feed in an image for our network to identify. Well, we will get a very poor output, which will obviously look nothing like what we expected. maybe something like:

$$[0.53, 0.72, 0.52, 0.73, 0.56, 0.23, 0.14, 0.01, 0.43, 0.35]$$

There is certain degree of error or loss or cost associated with this output. Since they all numbers, it is pretty simple to quantify that. We calculate the loss by a class of functions called loss functions. So given a prediction and label, a loss function measures the discrepancy between the algorithm's prediction and the desired output. A very popular loss function is the L_{2} loss function, or the Squared Loss Function, it is given as a summation of all the squared difference between the predicted value and the expected value

$$Loss(P,L) = \sum_{j=0}^N (P_j - L_j)^2$$

This sum will be large for bad guesses, and will be small for good guess, now all we have to do is figure out a way to make this sum smaller, or in other words we try to *minimising* this function. Does this ring a bell? If you are thinking calculus, you are right!

These buzzwords and all the hype does make them sound complicated, but once you try to understand it better, you'll realise, it just boils down to calculus.

High school calculus, if you want to minimise a function, just equate its first order derivative to zero, find the values of your variables of the function and you are good to go, those values are the parameter values that has minimum error or loss. The only issue here is that it's not a simple linear function we are dealing with, it is a function with over 12,000 parameters. So the approach of equating the first order derivative to zero isn't gonna work. Instead what we do is something called as *Gradient Descent*, I would strongly recommend this blog Understanding Gradient Descent for a comprehensive explanation about it. Essentially we start with a point in the weight space (a surface plot of weights against the loss) and find the derivative of the function with respect to each of the network weights, then adjust the weights in the direction of the negative slope. What we actually do is try figure out which direction the loss function steeps downward the most (with respect to changing the parameters), and step slightly in that direction. The gradient of a function gives us the direction of the steepest increase, i.e which way the function increase, naturally enough going the opposite direction is likely to give us the direction in which the function decreases. So we take a step which brings the largest change in the loss function. Changing the weights and biases by taking each step in the direction of the minima, i.e. setting them to the value of the parameters at the end point of the step starts fitting our output to the examples. But for millions of example, doing this for every example doesn't make sense, apart from over-fitting as a major side effect it is too expensive computationally. So we do this over a batch of training examples. This is known as *Batch Gradient Descent*. It's extremely difficult to comprehend this for a 12,730 dimension surface so a way to see this in a non-spatial way is to see these weights as a vector \(\vec W\) and similarly the negative gradient of these weights can be represented as \(-\nabla(\vec W)\) Now each value of this vector represents how much we have to nudge our weights, a negative value implies decreasing the value and a positive weight implies increasing the values. Also, the relative magnitudes of these components tell us which changes will make a larger difference to the cost function.

### Little Confessions

I have to confess something, when I said every neuron picks up a certain pattern, I actually lied. If you actually look at the weighted input a neuron is receiving it, what we see will be pretty much random. And not just for one neuron, for every neuron you see.

Probably you may find some very loose patterns, but most of the time, you won't be able to make any sense of it. And it shouldn't be much of a surprise, every neuron has found some local minima for itself, and all these neurons together factor in for our network to give the output. And this is why we need to look into every weights associated to every neuron when correcting for the loss. All the neurons are a part of the decision making process.

I also have another dirty little secret. This is not technically a neural network, rather a kind of it. This is a multi-layer perceptron, which is quite a primitive, rudimentary sort of architecture, yet a fundamental part of the domain. Neural network is more of an umbrella term for putting all the magic tricks under one name.

Why I did this? Well for the most I want to keep things as simple as possible. We will definitely dig deeper in future blog posts, but for now it is better to keep things simple so that it is easier to understand the concepts clearly

Let's get back to the learning

### Learning Process - Continued

So we have some idea about how to calculate a cost or loss of a prediction and also minimise it and the way we actually do that in neural networks is by backpropagation. The Backpropagation algorithm is used to learn the weights in a multilayer neural network. It performs gradient descent to try to minimize the cost function between the network's output values and the given target values. It allows us to use the chain rule of differentiation to calculate loss gradients for any parameter used in the network. For the most part, you need not know how exactly backpropagation works rather knowing what it does is more important. But treating this as just a black-box without knowing the underlying math, is not a good approach if you are serious about the stuff.

To understand backpropagation let's consider a toy network of three neurons

The output neuron is \(a^{(L)}\), the neuron before it is \(a^{(L-1)}\) and so on. Here the desired output (say \(y\)) of the neuron is 1. However we have get the output as 0.66, therefore the cost can be given as following

$$ C_0 = (a^{(L)} - y)^2$$

Also \(a^{(L)} = \sigma(W^{(L)}a^{(L-1)}+b^{(L)})\). To simplify it we can represents the parameters in the sigmoid function as \(z^{(L)}\), we therefore get $$a^{(L)} = \sigma(z^{(L)})$$

A tiny change in the weight \(W^{(L)}\) be given as \(\partial w^{(L)}\). Now we need to figure out how much this change affects the cost \(C_0\). i.e. we need to find the derivative of \(C_0\) with respect to \(W^{(L)}\). For this let's see how functions affect each other.

If we try to map out the parameters as to which parameters have an influence on which for this case we can make a tree of some sort.

We see that a small change in \(W^{(L)}\) causes a small change in \(z^{(L)}\), which in turn causes a small change in \(a^{(L)}\) and finally on the Cost \(C_0\). So we break things up as follows:

$$\frac{\partial C_0}{\partial w^{(L)}} =\frac{\partial z^{(L)}}{\partial w^{(L)}} \frac{\partial a^{(L)}}{\partial z^{(L)}} \frac{\partial C_0}{\partial a^{(L)}}$$

Applying chain rule and multiplying these ratios gives us the sensitivity of \(C_0\) to small changes in \(W^{(L)}\)

Now lets compute each derivatives

$$\frac{\partial C_0}{\partial a^{(L)}} = 2(a^{(L)} - y)$$ $$\frac{\partial a^{(L)}}{\partial z^{(L)}} = \sigma ′(z^{(L)}) = \sigma(z^{(L)})(1-\sigma(z^{(L)})$$ $$\frac{\partial z^{(L)}}{\partial w^{(L)}} = a^{(L-1)}$$

So our final equation turns out to be

$$\frac{\partial C_0}{\partial w^{(L)}} = a^{(L-1)}\sigma ′(z^{(L)})2(a^{(L)} - y)$$

Using the same technique, we can find the sensitivity of the cost with respect to the bias as well as the activation of the neurons it receives input from.

$$\frac{\partial C_0}{\partial b^{(L)}} = 1\times\sigma ′(z^{(L)})2(a^{(L)} - y)$$ $$\frac{\partial C_0}{\partial a^{(L-1)}} = w^{(L)}\sigma ′(z^{(L)})2(a^{(L)} - y)$$

This is the what backpropagation is in it's most granular form. For multiple training examples, it just involves averaging all the costs over all the examples.

$$\frac{\partial C}{\partial w^{(L)}} = \frac{1}{n}\sum_{k=0}^{n-1}\frac{\partial C_k}{\partial w^{(L)}}$$

And this is one component of the gradient vector \(\nabla C\). Now we can keep iterating the same idea backwards to other neurons in the map and see how much the cost changes with respect to the parameters in previous layers. Now this all seems pretty simple for this toy network, but much to your surprise, it's not very complex when it comes to real networks. So let's try to find a generalized from of backpropagation for some network.

Here the cost function will be a sum of the errors of both the output neurons.

$$ C_0=\sum_{j=0}^{n_L-1} (a_j^{(L)} - y_j)^2$$

The weight between the neuron \(a_j^{(L)}\) and \(a_k^{(L-1)}\) can we given as \(w_{jk}^{(L)}\) So with this we can find the value of \(z_j^{(L)}\) to be

$$z_j^{(L)} = w_{j0}^{(L)}a_0^{(L-1)} + w_{j1}^{(L)}a_1^{(L-1)} + w_{j2}^{(L)}a_2^{(L-1)} + b_{j}^{(L)}$$

Generalizing this we get

$$z_j^{(L)} = \sum_{k} \bigl( w_{jk}^{(L)}a_k^{(L-1)} + b_{j}^{(L)} \bigr)$$

and

$$\frac{\partial C_0}{\partial w_{jk}^{(L)}} =\sum_{0}^{n_L-1}\frac{\partial z_{j}^{(L)}}{\partial w_{jk}^{(L)}} \frac{\partial a_{j}^{(L)}}{\partial z_{j}^{(L)}} \frac{\partial C_0}{\partial a_{j}^{(L)}}$$ This is a sum over all the layer L, this is done because a neuron i layer L-1 affects both the neurons in L. So we add those up. Once we know how sensitive the cost function is to the activations in the second to last layer, we can just repeat the process for all the weights and biases feeding into that layer. So the derivative of each component \(\nabla C\) helps us to descend the gradient and find a minima my repeatedly stepping down the hill.

The backpropagation equations provide us with a way of computing the gradient of the cost function. Let's explicitly write this out in the form of an algorithm:

- Input \(x\): Set the corresponding activation \(a_1\) for the input layer.
- Feedforward: For each \(L=2,3,…,L\) compute \(z_L=w_la_{L−1}+b_L\) and \(a_L=\sigma(z_L)\)
- Output error \(C\): Compute the vector \(\nabla C\)
- Backpropagate the error: For each \(l=L−1,L−2,…\) compute $$\frac{\partial C_0}{\partial w^{(L)}} = a^{(L-1)}\sigma ′(z^{(L)})2(a^{(L)} - y)$$
- Output: The gradient of the cost function is given by \(\frac{\partial C_0}{\partial w_{jk}^{(L)}} =\sum_{0}^{n_L-1}\frac{\partial z_{j}^{(L)}}{\partial w_{jk}^{(L)}} \frac{\partial a_{j}^{(L)}}{\partial z_{j}^{(L)}} \frac{\partial C_0}{\partial a_{j}^{(L)}}\)

And this my friends is a neural network.

This article was inspired by the Deep Learning Series of 3Blue1Brown.

#### References

Neural Networks and Deep Learning

A Step by Step Backpropagation Example

CS 224D: Deep Learning for NLP1, Course Instructor: Richard Socher

Deep Learning Glossary

Book: The Master Algorithm

cs.stanford.edu

#### More Reading

How to choose the number of hidden layers

How Neural Networks are trained

Derivation of Backpropagation

Understanding Gradient Descent