Convolutional Neural Networks — Don’t Memorize, Learn Instead

Rohan Jagtap
7 min readMay 19, 2021

--

Introduction

Allowing computers to see sounds pretty exciting and scary at the same time. However, as I have said before, computers “see” differently than we do. Where we can perceive the shapes, colors, and movements of different objects, machines receive millions of numerical values. This means that where one person can identify an orange color in a picture, the computer would instead have a range of numbers displaying the RGB values of each picture.

One might ask why Multi-layer Perceptrons (MLPs) are not enough to enable deep learning image analysis. Well in Machine Learning, there exists an algorithm much more suited for image analysis, called Convolutional Neural Networks(CNNs/ConvNets). This algorithm uses two different computations: convolution and pooling layers. These are great for noticing certain patterns in images through “filters”. This will sound a little confusing, however, I will be going over each term in this article.

You might be wondering what specific patterns exist in an image. Take this image for example:

Credits to Rick Steinback, the creator of this retro sci-fi masterpiece! (Thanks Steven)

In this image, there are various “patterns” one can notice. For example, there are textures such as the one on the underbelly of the space dolphin, curves such as the shape of the space dolphin’s helmet, circles such as the one on the space dolphin’s back, straight lines such as those on the space dolphin’s arms, and much more. When this image is fed through a thoroughly trained CNN, a kernel(filter) might detect a certain “pattern” such as different textures, curves, or corners.

One thing to note:
The simpler geometric filters are what one would expect near the beginning of the Convolutional Neural Network. As our neural network delves deeper into the image, the filters gradually increase in sophistication as well. Filters can start off by detecting lines, curves, or edges and gradually detect full dogs or lizards.

Kernels/Filters

In our sample neural network, let’s imagine we have an input layer that accepts images of ducks. The next layer is our convolutional layer — whenever we define a convolutional layer we also need to define how many filters (kernels) we want. Let’s say that for this convolutional layer, we want one filter with a 3x3 matrix. All values in the matrix will be initialized with random values.

When the neural network receives an image, the kernel will slide over each 3x3 block of pixels from the input until every 3x3 block of pixels has been covered.

Convolution

This “sliding” is formally deemed as convolving. We could alternatively say that this kernel is convolving over each 3x3 block of pixels from the input image. The kernel sliding over each 3x3 block is a relatively simple operation — an elementwise multiplication that leads to a smaller matrix as shown below.

Credits (In this case the kernel values have been defined as [[0, 1, 2], [2, 2, 0], [0, 1, 2]]

This can result in blurry images with a wide variety of pixel values. Such as the pixel values of a “7” on the right side.

Using the second type of operation in a CNN, we can sharpen this image.

Max-Pooling

With Max-Pooling, we can take only the most prominent features from a convoluted output. We first define some n by n region, then max-pooling will take the highest pixel value in that region and add it to a smaller matrix.

Having specific strides can lead to output with specific detections. Up to this point, we see our kernel moving over each pixel, this would suggest that our stride would be set to 1. However, it is also possible to set your stride to be a different number — the higher your stride, the smaller the resulting 2D matrix. This is because increasing your stride means skipping pixels when the kernel slides over its next set of inputs, thus, further reducing the size of the output matrix.

Convolutional Layer

We can see why we would use a CNN on our input images. An important concept to note is using different kernels gives us different information. However, using kernels also removes a massive amount of information from the image, leaving only the bare essentials required to continue its journey throughout the entire neural network. The example above shows that our feature map is devoid of all colors, backgrounds, and textures. However, we are left with what one could make out as eyes, nose, and ears. You could notice that all the information that was cut out was not necessary for the network to determine what type of image is being fed in. This is how different values of our kernel lead to different features of the image being lost. There are some known types of filters whose effects are known as well as their values.

Overfitting

When you train a model, a weird phenomenon may occur where your model is hyper-focused on a particular set of data. So much so, that your model tries to memorize the data instead of using it to train the model. This results in different parameters being tuned to a specific set of data. Although your test dataset may have insanely high accuracy, a validation test will prove that your model has memorized the inputted data.

If this happens to models during training, how do we prevent this from happening?

Early Stopping

If you’ve ever coded a model from a tutorial, you might be wondering why we have a training dataset and a test dataset. If we already have a dataset to train the model and update its weight and biases, what would be the point of having a test dataset?

Well, your model may have an extremely high training accuracy, however, most often it is because your model has overfitted your data as mentioned above. Early stopping allows the developer to check how accurate their model truly is by setting aside a validation dataset which is brand-new data that the model has never seen before. This new data will then be sent through the CNN to determine its accuracy.

By regularly checking if your model is truly as accurate as it is on the training dataset, you can avoid the model overfitting data as much as possible.

Image Augmentation

Typically, you would need an enormous amount of data and a large amount of time to have an accurate model. But what if you did not have as much data as you would have liked. Is it still possible for you to train a decent CNN? Would it be possible to generate new data from pre-existing data?

Enter Image Augmentation — a clever way to generate more data than you currently possess. The logic behind this concept is that if you were to change pre-existing images with different modifications such as a rotation or a zoom, you would be able to generate new points of data that you can then use to train your CNN.

Image Augmentation also helps to mitigate your model overfitting the data as it modifies the images from different angles so that your model does not get used to seeing the same image, rather it will be fed different versions of the same images.

Drop Out

When a model is being trained, some neurons may have larger weights than others. When this occurs, a change in the neurons with larger weights will significantly impact the model’s output. This means that some neurons will exist that do not have a large weight, and so their weights are not tuned as often as the neurons with larger weights. By solving this problem, we can overcome overfitting, but how would we solve this?

Using drop out, you can “turn” some neurons off, meaning that they will not be updated/used, subsequently, they will not impact the output of the model. This allows the neurons with smaller weights to pick up the slack, enabling them to become activated. This will then train the neurons of the model that were previously dormant, allowing the model to figure out the output without memorizing the dataset.

Coded Example

Follow this link to look at my project. In my coded project, I trained a CNN to classify different images of cats and dogs according to their respective label. I have also incorporated some lines of code for the reader to view the different augmentations of their data.

Overall, I achieved ~80% accuracy with the model architecture that I used and with the methods discussed in this article. Though this is not an incredible amount of accuracy, I was fascinated with the drastic change in accuracy through Early Stopping, Image Augmentation, and Drop Out. By experimenting with different model architectures and values to initialize different layers, I’ve been able to get up to ~96% accuracy. After all, a large part of Machine Learning is experimentation!

More about me — my name is Rohan, I am a 16-year-old high school student learning about disruptive technologies and I’ve chosen to start with A.I. To reach me, contact me through my email or my LinkedIn. I’d be more than glad to provide any insight or learn the insights you may have. Additionally, I would appreciate it if you could join my monthly newsletter. Until the next article 👋!

--

--

Rohan Jagtap

Hey everyone! My name is Rohan, a Third Year student at the University of Waterloo learning about Artificial Intelligence.