## Series Introduction

This blog is a part of "A Guide To TensorFlow", where we will explore the TensorFlow API and use it to build multiple machine learning models for real-life examples. Uptil now we've learnt how to build simple machine learning models using tensorflow, This guide takes it to the next level, here we will code and run our own neural network. This guide is about linear regression.
Check out the other parts of the series: Part 1, Part 2, Part 3, Part 4, Part 5 and Part 6

## Pretext

Up till now we have built a linear regression and a logistic regression model. Now both of these are actually a single neuron doing the following two things

• Calculating a weighted sum of the input features and the bias. essentially performing a linear combination.
• Then applying an activation or transfer function to calculate the output. This being an identity function for linear regression and sigmoid for logistic regression.

In this guide we will try to build an artificial neural network using TensorFlow. Before we begin I strongly recommend you to read my article, Understanding Neural Networks, here you will find a comprehensive explanation of what neural networks are, and a very intuitive explanation on how they work.

## Motivation: Linear Separability Problem

Let's try to build a neural network that learns how to model a XOR (eXclusive OR) operation. A XOR operation returns 1 when either input equals to 1, but not when both do. The truth table for XOR looks like following:

Input 1 Input 2 XOR Output
0 0 0
0 1 1
1 0 1
1 1 0

With this we face a problem that makes sigmoid type of neurons unfit for modeling this operation, called the linear separability problem.

To understand the linear separability problem consider two-input patterns $$(X_1,X_2)$$ being classified into two classes. Each point with either symbol of $$x$$ or $$o$$ represents a pattern with a set of values $$(X_1,X_2)$$ as shown in the figure below. Here we can see that each pattern is classified into one of two classes. Notice that these classes can be separated with a single line $$L$$. They are known as linearly separable patterns. Linear separability refers to the fact that classes of patterns with $$n$$-dimensional vector $${\bf x} = (x_1, x_2, ... , x_n)$$ can be separated with a single decision surface. In the case above, the line $$L$$ represents the decision surface.
Now our linear/logistic regression model, essentially a single-layer perceptron network is able to categorize a set of patterns into two classes because the linear threshold function defines their linear separability. Conversely, the two classes must be linearly separable in order for the perceptron network to function correctly.

As for XOR, it is a classic example of a linearly inseparable pattern. Shown in the figure below is the illustration of XOR function that two classes, 0 for black dot and 1 for white dot, cannot be separated with a single line. The solution seems that patterns of $$(X_1,X_2)$$ can be logically classified with two lines $$L_1$$ and $$L_2$$ This problem actually resulted in neural network research losing importance for about a decade around 1970’s. This was fixed by intercalating more neurons between the input and the output of the network, i.e. added a hidden layer of neurons between the input and the output layers. One can think of it as allowing our network to ask multiple questions to the input data, one question per neuron on the hidden layer, and finally deciding the output result based on the answers of those questions, thus allowing the network to draw more than one single separation line.

## The Dataset

We are going to use the MNIST Dataset of handwritten digits, it has a training set of 60,000 images, and a test set of 10,000 images. Each image is of 28px by 28px and contains gray levels. It is a good database for people who want to try learning techniques and pattern recognition methods on real-world data while spending minimal efforts on preprocessing and formatting and comes along with the tensorflow package for testing purposes.

## The Code

We are going to use name scopes to group operations in order to understand the code better, this is going to be specially useful when visualizing in TensorBoard

# Importing Libraries and Dataset

# Training Parameters

# Network Parameters

# Defining Input and Target Placeholders

# Defining Ops in Hidden Layer 1
with tf.name_scope("Hidden_Layer_1") as scope:
...

# Defining Ops in Hidden Layer 2
with tf.name_scope("Hidden_Layer_2") as scope:
...

# Defining Ops in Output Layer
with tf.name_scope("Output_Layer") as scope:
...

# Defining the Optimizer Fn
with tf.name_scope("Optimization_Block") as scope:
...

# Variable Initialization
init = tf.global_variables_initializer()

# Defining Session to run training and logging values for TensorBoard visualiztion
with tf.Session() as sess:
...


### Libraries, Dataset and Parameters

The MNIST dataset is available in tensorflow.examples, can can be easily accessed as follows

import tensorflow as tf

# Importing MNIST data
from tensorflow.examples.tutorials.mnist import input_data
mnist = input_data.read_data_sets("/tmp/data/", one_hot=True)


Next we are going to declare training and network parameters, we are building a multilayer perceptron with 2 hidden layers, making 2 layers in total. With a batch size of 100, we are going to run 15 epochs. The input layer has 784 neurons, one for each pixel, that will take in the gray level value of that pixel, the two hidden layers have 256 neurons each and the output layer has 10 neurons, each corresponding to a digit from 0 to 9.

# Training Parameters
learning_rate = 0.001
training_epochs = 15
batch_size = 100
display_step = 1

# Network Parameters
n_hidden_1 = 256  # 1st hidden layer number of neurons
n_hidden_2 = 256  # 2nd hidden layer number of neurons
n_input = 784  # MNIST data input (img shape: 28*28)
n_classes = 10  # MNIST total classes (0-9 digits)


### Input and Target Placeholders

We use tf.name_scope to separate each logical block in the code, this convention will be followed throughout this guide.
We define the input and target placeholder (explained in Part3) with float datatype and a dynamic shape as defined below

with tf.name_scope("Input") as scope:
inputs = tf.placeholder("float", [None, n_input])
with tf.name_scope("Target") as scope:
targets = tf.placeholder("float", [None, n_classes])


### Defining Hidden Layers

The first hidden layer has 256 neurons receiving activations connections from 784 input neurons, therefore the shape of the weight matrix will be 784×256. For each neuron in this layer, we will have a bias vector of length 256.
Each tensors are initialized with a normally distributed random value.
We use tf.summary.histogram to log the weights and biases at each step. We use the relu activation function for each layer.

with tf.name_scope("Hidden_Layer_1") as scope:
weight1 = tf.Variable(tf.random_normal(
[n_input, n_hidden_1]), name="Weights1")
biases1 = tf.Variable(tf.random_normal([n_hidden_1]), name="Biases")

tf.summary.histogram("weight_1", weight1)
tf.summary.histogram("biases_2", biases1)

h1Layer = tf.add(tf.matmul(inputs, weight1), biases1)
h1Layer = tf.nn.relu(h1Layer, name='h1Activation')

tf.summary.histogram("relu_2", h1Layer)


Similarly we can define Ops for the weights and biases of the second hidden layer,

with tf.name_scope("Hidden_Layer_2") as scope:
weight2 = tf.Variable(tf.random_normal(
[n_hidden_1, n_hidden_2]), name="Weights1")
biases2 = tf.Variable(tf.random_normal([n_hidden_2]), name="Biases")

tf.summary.histogram("weight_2", weight2)
tf.summary.histogram("biases_2", biases2)

h2Layer = tf.add(tf.matmul(h1Layer, weight2), biases2)
h2Layer = tf.nn.relu(h2Layer, name='h2Activation')

tf.summary.histogram("relu_2", h2Layer)


### Output Layer

Similar to the hidden layers we will define weight and bias tensors and the output function

with tf.name_scope("Output_Layer") as scope:
weight3 = tf.Variable(tf.random_normal(
[n_hidden_2, n_classes]), name="Weights3")
biases3 = tf.Variable(tf.random_normal([n_classes]))

tf.summary.histogram("weight_3", weight3)
tf.summary.histogram("biases_3", biases3)

output = tf.add(tf.matmul(h2Layer, weight3), biases3)


If you notice we are not using any activation function for this layer, this is because we are using softmax cross entropy with logits. Logits simply means that the function operates on the unscaled output of earlier layers. tf.nn.softmax_cross_entropy_with_logits computes the cross entropy of the result after applying the softmax function. The softmax function "squishes" the inputs so that sum(input) = 1; it's a way of normalizing. The output of softmax can hence be interpreted as probabilities.

### Optimization Block

Here we define our cost function, which is softmax cross entropy with logits as explained above. We then use tf.reduce_mean to calculate the mean value of the loss, this is done across the dimension of the tensor. The next thing we are dong is defining our optimizer. Here we will be using a very popular algorithm called the Adam Optimizer. How it actually works is not in the scope of this guide, however Jason from machinelearningmastery.com has a great article on it if you're interested in reading about it.

with tf.name_scope("Optimization_Block") as scope:
cost = tf.reduce_mean(tf.nn.softmax_cross_entropy_with_logits(
logits=output, labels=targets))
tf.summary.scalar("cost", cost)



### Training Block

The first step is to start a session, we then run the session and initialze all the variables using sess.run() and tf.global_variables_initializer

with tf.Session() as sess:
sess.run(tf.global_variables_initializer())


Now we merge all the summaries we have defined using summary.merge_all() and assign it to variable, we will run this to log the weights and biases at each training step. The next step is to create a FileWriter instance.

    summaryMerged = tf.summary.merge_all()
writer = tf.summary.FileWriter("/tmp/mnist_mlp", graph=tf.get_default_graph())


Now for each epoch we define a avg_cost variable (initially 0), and total_batch that defines the number of batches.

    for epoch in range(training_epochs):
avg_cost = 0.
total_batch = int(mnist.train.num_examples/batch_size)


Now we will loop over each batch, we use train.next_batch inbuilt function to access the next batch

        for i in range(total_batch):
batch_x, batch_y = mnist.train.next_batch(batch_size)


Now we run optimization op for back-propagation, the cost op to get loss value and the summaryMerged op to save weights and biases for each training step to a variable summary.
We can then use writer.add_summary function to log the summary at each epoch

            _, c, summary = sess.run([optimizer, cost, summaryMerged], feed_dict={inputs: batch_x,
targets: batch_y})
writer.add_summary(summary, epoch * total_batch + i)


The average cost is calculated as follows, and then we print the cost at each epoch

            avg_cost += c / total_batch
# Display logs per epoch step
if epoch % display_step == 0:
print("Epoch:", '%04d' % (epoch+1), "cost={:.9f}".format(avg_cost))
print("Optimization Finished!")


Now once the training is done, we test our network for accuracy.

    pred = output
correct_prediction = tf.equal(tf.argmax(pred, 1), tf.argmax(targets, 1))

    # Calculate accuracy
accuracy = tf.reduce_mean(tf.cast(correct_prediction, "float"))

print("Accuracy:", accuracy.eval(
{inputs: mnist.test.images, targets: mnist.test.labels}))


We assign the output values of the last layer to a variable pred. Now we create a template for accuracy evaluation in a function assigned to accuracy and then we can use accuracy.eval() to run that function.
This will give us the reduced mean value of the prediction. This is the average value of our correct predictions noted as variable correct_prediction. We use the tf.equal function to compare the predicted value and the target value. We do this by comparing the index of the highest values of the prediction and the target output. Suppose that the target output (T) and Predicted output (P) is as follows
$$T = [0, 0, 0, 1, 0, 0, 0, 0, 0, 0]\\ P = [0.03, 0.01, 0.08, 0.63, 0.07, 0.05, 0.03, 0.06, 0.02, 0.02]$$
Then tf.argmax(T) will return 3 and tf.argmax(P) will also return 3. This indicates that our prediction is indeed correct.

## Running The Code

Once you run this program you'll get an output something like this.

Epoch:0001 cost=147.860720253
Epoch:0002 cost=39.784715133
Epoch:0003 cost=25.110946123
Epoch:0004 cost=17.588648866
Epoch:0005 cost=12.923417049
Epoch:0006 cost=9.639559306
Epoch:0007 cost=7.066375390
Epoch:0008 cost=5.424123222
Epoch:0009 cost=3.995865268
Epoch:0010 cost=3.116271071
Epoch:0011 cost=2.207934179
Epoch:0012 cost=1.686281606
Epoch:0013 cost=1.275378693
Epoch:0014 cost=1.032349929
Epoch:0015 cost=0.836090075
Optimization Finished!
Accuracy: 0.9456


We can also see a TensorBoard visualization of our graph and also how our weights and biases changed over time during the training.
For doing so, we can enter the following in the command terminal:
tensorboard --logdir=/tmp/mnist_mlp
We can then navigate to 127.0.0.1:6006 on a browser to see the TensorBoard output:
Under the graph tab we can see the network we have made, organized using the name scopes we defined earlier. Under the histogram tab we can see the weights and biases. When it comes to tuning the network or it's architecture TensorBoard can prove to be a very useful utility.
In the histogram tab, we don't see much differnce between the graphs of hidden layers 1 and 2. This indicates that there is not much learning happening between these two layers, and this network can very well performed if reduced to 3 layers instead of 4. This was one of the most simplest and probably naive way to optimize your network architecture.

## Wrapping it up!

This is the last article in this series for now, TensorFlow is one of the most versatile computational framework, a lot of libraries are built on top of it abstracting various common algorithms and code to help prototype faster, Keras is one such great library worth checking out. Hope this guide helps people write and share good models built using TensorFlow. Thanks for reading!