Backpropagation (2/2) — Working Backwards To Push Forward Innovation — The Calculus Approach

12 min readApr 3, 2021

Before Reading:
Make sure to pay special attention to the different types of notation being used throughout this article. By far the most intimidating factor in Backpropagation relates to one not understanding the different symbols and indices. A sound understanding will clear the confusion up.

Introduction

From the previous article on Backpropagation, we learned that after the input data goes through our Neural Network through Forward Propagation, we get a certain output. We then use an activation function to calculate the loss for that output using the actual correct values — telling us how “off” our values are. This is where Gradient Descent comes into play, finding derivatives of the cost function with respect to each of the weights in the Neural Network. This derivative is then used to update the weights over numerous iterations to the optimal values resulting in the lowest cost function output.

The tool which enables Gradient Descent to calculate the derivatives is called the Backpropagation. This process is the focus of this article — I will be going through all of the calculus involved while trying to make it as beginner-friendly as possible.

Definitions and Notations

Notation Clarifications

In a given neural network, L represents the number of layers contained within the network. For each layer L, j will represent the number of nodes in that layer — assigned numbers from 0 to one less than the total number of nodes in that layer. When assigned the layer L, the layer previous L-1 will have its nodes represented by k, assigned numbers from 0 to one less than the total number of nodes in layer L-1. The value yⱼ represents the correct value of the output node j in the last layer (output layer) of the neural network labeled L. The loss function C₀ is the value for a single output value using the Mean Squared Error (MSE) function. The next notation, wⱼₖ, with a specified layer L is the weight of the connection that connects the node k from the previous layer (denoted L-1) to the node j in the current layer (denoted L). wⱼ with a specified layer L is a vector that contains each weight from node j in the current layer back to all the nodes k from the previous layer (denoted L-1). The input node for a node j in the current layer L is labeled zⱼ with a specified layer L, in the form (the sum of [weight of the connections of the previous layer to the current node * the inputs of the last layer] + the respective biases). g(l) denotes the activation function used for each of the nodes in layer L. Lastly, aⱼ with a specified layer L is the activation output of node j in layer L.

Expressing Conceptual Understandings In a Mathematical Form

This section will consist of concepts that we already understand about neural networks at a high level, simply expressed mathematically using the notation introduced above.

1. Mathematical Expression of the Loss Function

In the case of this article, we will be using the MSE cost function as described previously. Essentially, this means that we will be squaring the difference between a single output value from our model and the actual correct output value from our data, then summing this difference for each of the output values from our neural network model.

Using the notation described above:

The above equation tells us that C₀ will equal the square difference between the activation output of node j in layer L and the actual value for node j. Now, making this applicable for each output value from our neural network gives us the following equation:

All the big backward-looking “3” thing means is the sum starting from the bottom value to the top value. In this case, our index of nodes starts from 0 to one less than the number of nodes. The rest of the cost function is the same.

The above equation now sums up all the differences of all the nodes starting at the top of the output layer (where the node is indexed as 0) to the bottom of the output layer (where the last node is indexed as n-1, where n = the number of nodes in the current layer).

2. Mathematical Expression of the Input Zⱼ with layer L

From my article on Forward Propagation, I discussed how input values propagate through the neural network. For a node j in a given layer L, the input is a weighted sum of all the activation function outputs from the previous layer L-1. This means that one of the inputs for a node j in a given layer L can be mathematically expressed as:

If we were to change this expression to denote the input of a particular node j in a given layer L to include all the weighted sums of all nodes in layer L-1, it would look like:

3. Mathematical Expression of the Activation Output aⱼ with layer L

When a node receives an input ( zⱼ), it is first put through an activation function (g for a layer L) to get an activation output (aⱼ for a layer L). This is to map the input to a particular range of outputs, and this range will depend on what your project requires. We can express this mathematically:

4. Mathematical Expression of the Loss Function as a Composition of Functions

From the earlier mathematical we denoted C₀ as the sum of a squared difference between the output of the model and the actual output according to the dataset:

We also covered what C₀ would look like for a single output node:

Since we know what happens to the value of the activation function output of an output node in the Loss Function. When actually calling the Loss Function to receive a loss value, we can express the function as:

Notice how we do not have yⱼ as a parameter, since it is a constant in the y vector of actual output values from the dataset. This means that since yⱼ is a constant, it should not be included as a parameter, rather it helps define what the loss value should be.

From past mathematical expressions, we also know as the activation function is the result of the input (zⱼ) being passed into an activation function g(l). We can take this a step further as we know that the input (zⱼ) is from the weighted sum of the connections from the current node in layer L to all the nodes in the previous layer L-1(wⱼₖ ):

Yes, we originally defined the input zⱼ as the sum of the weights of the previous layer multiplied by the activation function output. However, putting that into notation would require use to then break up the activation function output of the previous layer into its own composite functions. This process would continue on and on until we have a really ugly-looking function — to save time on notation, we will assume that the input zⱼ is the sum of the weights of the previous layer.

We now have the Loss Function expressed as a composite of different functions that we have defined earlier. Now to express this function as the sum for each output node, we get:

The reason we went through this work of expression of our previously known concepts as a mathematical expression is that we want to differentiate this loss function with respect to each of the weights in the neural network. To do this, we can use the chain rule to differentiate a composition of functions.

The Backpropagation Function

To get a sense of how the Backpropagation function works, let’s introduce a sample expression of a single weight connecting node 2 in layer L-1 to node 1 in layer L. The derivative of the cost function with respect to this weight would be:

From the composition of C₀, we can use the chain rule to help us figure out this value:

The mathematical equation is what we can use to figure out the gradient of the Cost Function with respect to the single weight connection node 2 in layer k to node 1 in layer j. We can now find the derivative of the cost function with respect to our weight.

Term 1: Derivative of the Cost Function

From the video, we can see the derivative of the cost function with respect to the weight connecting node 2 in layer L-1 to node 1 in layer L is double the difference between the model output and the actual output.

Term 2: Derivative of the Activation Function With Respect to the Input

We now move on to the second term in the function that we will use to calculate the derivative of the activation function of node 1 in our layer L with respect to the input of that node.

Recall that the activation output for a single node j in the output layer L is the result of plugging the node input into the activation function:

Since we’re working with node 1 in the layer L of our network, we get:

When we plug this into our second term, we compute this expression:

Term 3: Derivative of the Input With Respect to the Weight of the Connection

The term that we receive for the third component in our application of the chain rule basically means that the input for node j in a given layer L will respond to a change in the weight w₁₂ equal in magnitude to the activation output for node 2 in the previous layer denoted L-1.

Putting the Three Terms Together

You might remember the chain rule from earlier:

After working with each term individually from the right side, we now get the expression:

This expression means that by changing the weight of the connection from node 2 in the previous layer to node 1 in the output layer, our change will be equal to the product of the expression above. However, that value would only account for one training sample — to account for all the training samples:

We would then do this exact process for each weight in the neural network to calculate the derivative of the loss function with respect to the weight for each connection.

Find the Derivative of the Weight of Any Node in the Neural Network

From the last section, we know that the derivative of the weight of a node in the output layer can be calculated using:

For this section, let’s assume that we’re working with the weight that connects node 2 in the L-1 layer and node 2 from the L-2 layer. Using the chain rule for the derivative of the loss function with respect to this new weight will give us:

This new chain rule equation is very similar to the previous equation, except for the change of layer indexes due to working with weights in a different part of the neural network. Although the latter two parts of this new equation will use the same method of computation, the first part (calculating the derivative of the loss function with respect to the activation function output from node 2 of layer L-1) will be approached differently.

Well, why is this true? In the last function, the weight we chose directly fed into the loss function as it is defined as the sum of squared error values of all the values from the output layer. This is not the case in this new equation. In the new equation, the weight is in layer L-1, a full layer before the value is fed into the loss function. Learn how to calculate this derivative will be the focus of the rest of this article.

Calculating the First Section of the New Equation

From our neural network, the loss function values depend on the activation function values from the nodes of the output layers. The activation function values from the nodes of the output layers depend on the input values. The input values depend on the weight of the connections touching each node in the output layer and the activation values from the nodes in layer L-1.

This means that a change in the node that we have chosen to work with (node 2 in layer L-1) impacts the outcome of our neural network. Additionally, note that the dependencies of the loss functions are actually just a series of composite functions. From this we can infer that to calculate the derivative of the loss function with respect to node 2 in layer L-1, we will be using the chain rule. This tells us that the first section of the new equation is a product of the composition functions that make up the value.

This relationship is mathematically expressed through:

The reason we have the summation symbol is that a change in node 2 in layer L-1 will affect all the nodes in layer L. Since more than one node is affected, the changes have to be summed up. The first two sections of this equation show look natural to you but the third section should raise some eyebrows in terms of computation.

Essentially, this means that the input for any node j in layer L will cause a change in the activation function output of node 2 in layer L-1 by an amount equal to the weight connecting node 2 in layer L-1 to node j in layer L. Our new equation with elements from the previous example and current example plugged in looks like:

With this mathematical masterpiece, we can now calculate the gradient of any node in our neural network, such as node 2 in layer L-1. This equation will then be plugged back into this equation:

Conclusion

By using the chain rule, we can calculate the gradients for different values by using values later on in the neural network. For instance, to calculate the gradient of the loss function with respect to node j in layer L, we need the derivatives that depend on the activation output, which depends on the input for this node, which depends on a weight. To calculate the gradient of the loss function with respect to the weight that we just worked with, we need derivatives that depended on the inputs and the output of the activation function from those inputs. The derivative that depended on this activation function depended on all of the activation outputs and all of the inputs for the nodes in layer L. In essence, we’re needing to calculate derivatives that depend on components later in the network and then use these derivatives in our calculations for the gradient of the loss with respect to weights that come earlier in the network. We repeatedly apply the chain rule in a backward fashion — hence, backpropagation.

More about me — my name is Rohan, I am a 16-year-old high school student learning about disruptive technologies and I’ve chosen to start with A.I. To reach me, contact me through my email or my LinkedIn. I’d be more than glad to provide any insight or learn the insights you may have. Additionally, I would appreciate it if you could join my monthly newsletter. Until the next article 👋!