Activation Functions

Rohan Jagtap
8 min readApr 1, 2021

Preface: For the rest of this article, it is recommended that you have a high level understanding of what a Neural Network is. In order for you to appreciate how activation functions help us, you first need to understand what the main motivation is behind using Neural Networks. If that sounds like you, keep reading on!

Introduction

Activation functions are what some would call one of the main building blocks of Artificial Neural Networks (ANNs). In a world where the majority of the problems in the world cannot be fixed by approximating straight lines, activation functions become essential to ANNs. Although activation functions increase the complexity of our model, an important note is that they introduce non-linearity.

When our models find connections through the data we feed through it, not all the data is going to be relevant. This is where the handy activation comes into play. The activation function takes a weighted sum and the addition of a bias to tell us when a neuron should be activation and when a neuron should not.

In this article, I will be going through a non-comprehensive list of some of the most widely used activation functions. Additionally, I will also be discussing when a developer should take advantage of specific activation functions.

1. The Binary Step Function

When determining which neuron should be activated in the neural network and which neuron should not be activated, setting a value threshold immediately comes to mind. This function represents this step of logic. All values below zero become equal to zero and all values greater than or equal to zero become one.

A mathematical look at the Binary Stepper Function.
def binary_stepper(x): //Python Implementation
if (x < 0):
x = 0
else:
x = 1

During the backpropagation process, gradients are calculated to update the weights and biases. However, since the gradient of this function is zero, the weights and biases do not update. A common use case for this type of activation function would be in a classification model with only two classes. Such an activation function would prove to be quite useless with the addition of other classes.

2. Linear Function

With the Binary Step Function, we saw that the problem was the gradient became 0 due to the lack of an “x” component. Instead of using a Binary Step Function as our activation function, we can use the Linear Function which will output a different “y” for every “x” value.

A mathematical look at the Linear function.
def linear_function(x): //y = ax, a = 4
return 4*x

3. Sigmoid Function

The next activation function is probably one of the widely used activation functions. It takes any value as input, and maps it to a value between 0–1.

A mathematical look at the Sigmoid function.
import numpy as npdef sigmoid_function(x):
z = (1 / (1 + np.exp(-x)))
return z

Unlike the previous two activation functions, this activation is a non-linear. Consequently, this means that the output from the sigmoid function is also non-linear.

The discrepancy of ranges are the largest from -3 ≤ x ≤ 3. After that bump in the domain specified earlier, the values tend to die down which means the gradient values tend to be lower. This represents the different values of gradients found through the following function.

def sigmoid_gradient_function(x):
return sigmoid_function(x) * (1 - sigmoid_function(x))

The Sigmoid function is normally used in binary classification projects. However, due to the vanishing gradient problem it is sometimes avoided in neural network projects.

4. Tanh Function

Much like the sigmoid function, the Tanh activation function shares some similarities such as the general shape, but different in terms of numerical value ranges. The Tanh function takes an input and maps it to a value between -1 to 1. An important distinction here is that the values from the previous layer to the next layer may not always have the same sign.

A mathematical look at the Tanh function.
def tanh_function(x):
return (2 * sigmoid(2*x)) - 1

Similar to the Sigmoid function, the Tanh function is also a non-linear activation function. This means that we can take a look at the gradients of this function.

We can see that the Tanh gradient bump is a little stepper than the Sigmoid gradient bump, this is because there is a larger rate of change from -1 → 1 than 0 → 1.

5. ReLU Function

The Rectified Linear Unit (ReLU) activation function is extremely popular in the Deep Learning community. This is due to the fact that by using this activation function, not all neurons are activated at the same time. The ReLU function should only be used in hidden layers.

Conceptually, this is incredibly easy to understand. Any value less than 0 is made to be 0, and any value equal to or greater than 0 is kept the same.

A mathematical look at the ReLU function.
def relu_function(x):
if(x < 0):
return 0
else:
return x

Taking a look at the possible gradient values, we notice something peculiar.

For all of the input values that are below zero, their gradients are equal to zero. During the backpropogation process, this can cause some weights and biases not to update — leading to dead neurons in the neural network model.

6. Leaky ReLU Function

The Leaky ReLU function was made to fix the problem that the ReLU function could cause — dead neurons in the neural network model. This solution comes through a very slight slope placed on the input values below zero of approximately 0.01.

A mathematical look at the Leaky ReLU function.
def leaky_relu_function(x):
if x < 0:
return 0.01 * x
else:
return x

We can now take a look at the gradient values of this function.

Although it may look like the slope on the left side is on the x-axis, if you were to zoom in a little further, you would find that the slope is actually just above zero.

7. Parameterised ReLU Function

The Parameterised ReLU function is yet another variant of the ReLU function where instead of the 0.01 slope in the Leaky ReLU function, you now are able to pass whichever slope suits your needs best.

A mathematical look at the Parameterised ReLU function.
def parameterised_relu_function(x, a):
if x < 0:
return a*x
else:
return x

In the case of a Parameterised ReLU function, ‘a’ is actually a trainable parameter. When the network learns about the value of ‘a’, the result is faster and more optimum convergence.

8. Exponential Linear Unit Function

Another variant of the ReLU activation function, the Exponential Linear Unit (ELU) function is able to change the shape of the negative input values.

A mathematical look at the Exponential Linear Unit function.
import numpy as npdef exponential_linear_unit_function(x, a): //a = 1.5
if x<0:
return a * (np.exp(x) - 1)
else:
return x

We can now take a look at the gradient values of this function, assuming that a = 1.5.

9. Swish Function

Swish is a lesser known activation function that was discovered by researchers at Google. It is a computationally competent and actually outperforms the popular ReLU function on Deeper Learning Models. The values for the Swish function ranges from -∞ to ∞.

def swish_function(x):
return x * sigmoid(x)

Since the curve of this function is smooth, the function is differentiable at all points of the curve. This helps during the model optimization process which is one of the reasons why the Swish function outperforms the ReLU function.

10. Softmax Function

The Softmax function returns the probability for a particular datapoint to belong to a certain class. This function is a little different from the others I have listed as it uses a sum value of all the inputs to determine the probability for an individual input.

def softmax_function(x):
z = np.exp(x)
z1 = z / z.sum()
return z1

This may look a bit confusing, but all this is doing is dividing an individual element in the vector a, over the sum of all the input values passed in e^a to get a probability between 0–1.

Conclusion

This was a non-comprehensive list of activation functions as there are simply too many functions to cover in one article. However, this list did describe some of the most popular activation functions in the Machine Learning community.

An additional note that should be mentioned is that these activation functions and their described uses are just a rough idea of the situations that may arise. Many projects involving neural networks involve a lot of trial-and-error. Thus, experimentation with different activation functions for your particular use case would be the path the training the best model.

More about me — my name is Rohan, I am a 16-year-old high school student learning about disruptive technologies and I’ve chosen to start with A.I. To reach me, contact me through my email or my LinkedIn. I’d be more than glad to provide any insight or to learn the insights that you may have. Additionally, I would appreciate it if you could join my monthly newsletter. Until the next article 👋!

--

--

Rohan Jagtap

Hey everyone! My name is Rohan, a Third Year student at the University of Waterloo learning about Artificial Intelligence.