Multivariate Linear Regression…THERE’S MORE?

Rohan Jagtap
DataDrivenInvestor
Published in
10 min readJan 3, 2021

--

Hey Readers! This is a continuation of my medium trend — understanding the theory/mathematics behind machine learning. Since I have started a focus on this topic, I will be writing medium articles describing my journey, stay tuned for more content. Now, back to my article!

In a previous article, I’ve covered the mathematics behind univariate linear regression and the implementation of it in Python. Now you may be looking at multivariate linear regression and think, “Well, my brain is about to be fried”. And while we will build on previous concepts, I assure you this topic is much easier to comprehend than you might think. The claim is backed by my money-back guarantee! (Get it? This article is free 😉)

Before we jump into the madness, I’ll quickly run through some background knowledge. Enter: Vectors and Matrices! (If you’re here just for the explanation, start at Updating Equations)

Arrays

If you’ve ever coded before, you might be familiar with a data type known as an array. Essentially, it is a data type that holds elements of the same type which are freely accessible using an element’s index. Here’s a quick diagram:

Here’s the cool thing about arrays—they can be nested within each other. You can have one-dimensional, two-dimensional, three-dimensional, all the way to an n-dimensional array (provided that your computer can handle such a command). Again, any element within these n-dimensional arrays can be accessible with the proper series of indices. Here are the first three:

Vectors and Matrices

In terms of machine learning, vectors and matrices are both arrays, but they have a very important difference. A vector is any column/row while a matrix is any rectangular array. Let me give you a concrete example:

The entire 4x6 rectangle is one matrix while one row(1,2,3,4,5,6) or column(1,7,13,19) is a vector. Here’s the way that I remember it: an array with “1” in its dimensions is a vector, an array with two numbers in its dimensions that are not “1” is a matrix.

Vectors and Matrices Notation

In arrays, we have a straightforward method of addressing each element. But how does this work in vectors/matrices? Each value held in a vector/matrix is called an “entry”. Referencing an entry can be done by following the notation: (Name of Matrix)(row, column). Let’s name the matrix we just saw as A. To access value “14”, we would use A₃₂.

Matrix Operations

Now that we’re familiar with what vectors and matrices actually are and the notation they use, we can start manipulating their values. I will cover the basics of these operations, but since they are not the main topic of this article, I will link additional resources that better explain exactly how these operations work.

Each entry in the same position is added together
Similarly, entries in the same positions are subtracted

For you to add two matrices together, they need to have the same dimensions. Notice how we are adding two matrices with a size of 2x2. Each entry with the same position is added together to form another matrix of values. Subtraction works the same way that addition does; entries with the same position are subtracted. It works the same way with vectors, however, the vectors must have the same dimensions.

With matrices, we can also perform operations called scalar multiplication and division. All this means is that we can multiply or divide by a single number:

Hm…seems pretty straightforward to me!

Unlike addition and subtraction, we can multiply or divide a matrix and a vector. Here is an example:

Before we can perform this operation, we must first see if it is possible. The rule goes as follows: the number of columns in the matrix must equal the number of entries in the vector. In a more mathematical definition, in an mxn matrix with m rows and n columns, the vector must be an nx1 matrix with n rows. If our matrix and vector have met these conditions, we can proceed.

Visual Representation of Matrix and Vector multiplication, Andrew Ng

To make the operation simpler, section off a row of the matrix. From left to right, the first entry in the selected row of the matrix will be multiplied by the first entry of the vector (a * x = ax). Then, the second entry in the same row of the matrix will be multiplied by the second entry of the vector (b * x = bx). This pattern continues until all the entries in a row of the matrix and all the entries in the vector are multiplied together. All the values that were multiplied in the same row are then added together (ax + by). This continues until each row of the matrix has gone through this process.

Well, what about multiplication and division between matrices? Before we answer that question, we need to see if it is even possible to perform this operation.

The rule goes as follows, the number of columns in the first matrix must be equal to the number of rows in the second matrix. A more mathematical definition would be in an mxn matrix with m rows and n columns, we would need to multiply it by a nxo matrix. If these conditions have been met, we can proceed.

Visual Representation of multiplication between matrices, Andrew Ng

To answer the question about how we are going to multiply or divide matrices, we’re going to apply the same logic, however, we will have to split one of the matrices into separate vectors first.

Once we have the two pairs of matrix and vector multiplication, we can carry out the same operations to find the values of the resulting matrix. After finding those values, we simply plug them back in to form the matrix.

The same steps would apply if we were dividing. Replace the places we multiplied the two values by dividing the two values instead. Alright, now we’re done with the math re-cap we need to continue with this lesson.

Updating Equations

In the article on univariate linear regression, we used the equation: hᶿ(𝑥) = θ₀ + θ₁𝑥. This equation assumed that we are only interested in finding a relationship between two variables — but what if you have more than 2, 3, 4, 1,000,000? Events, in reality, are rarely ever dependent on a few variables, to accommodate the wealth of data we have access to, we will need to update our equations to take in as many parameters as required.

The wealth of data as I’ve previous referred to are called features. Our equations will need to take in various relevant features as they will all impact the slope of the linear regression. Here’s the updated hypothesis equation:

Each x represents the data for a separate feature. Just like univariate linear regression, we are trying to figure out the different thetas so that we can accurately fit a line in different sets of data. Each parameter value decides how much a particular feature affects the outcome. Let me provide a concrete example of what this looks like:

Additionally, for notation’s sake, let’s make 𝒳₀ = 1. This allows us to carry out operations for θ₀, if not, we would have different matrix dimensions which does not allow us to carry out matrix operations. Notice how changes in feature X₃ will affect the outcome of the cost function much more than feature X₂ will.

We can actually reduce our hypothesis equation to a simple expression by picturing all the theta and x values as two different matrices. But before I introduce this new equation, I’ll quickly go over the transpose of a matrix. All the transpose does is interchange the column and rows of a matrix so that the dimensions are suitable to perform multiplication operations. An easier way to think about this concept is that if you had an entry A₂₃, after transposing matrix A, that entry value would become A₃₂.

Now let me introduce the updated cost function or mean squared error function that takes in all of the new possible parameters:

There’s not much to explain here, except that this function results in the average vertical distance between each data point and the data point on our predicted line. The only things that have changed from this function and the function with univariate linear regression is the number of parameters and the modification of our hypothesis function to accommodate the new parameters. Also, remember that our Xⁱ is a vector containing our features and not just one value.

Naturally, with the cost function also comes the gradient descent algorithm to tune all of our different parameters to minimize the error:

As for the cost function, there aren’t any major new modifications. The only differences are the number of parameters which has caused the cost function to be slightly altered and multiplying the x feature on the end which will instead be an entry of a particular row of the feature matrix. (If you have no clue what this function does, please refer to the article I made on univariate linear regression that I linked at the top, I’ve made sure to explain simply!)

Feature Scaling

When practically applying multivariate linear regression, an interesting phenomenon occurs — you tend to get a wide variety of different ranges for different features. These varying ranges tend to make gradient descent algorithms much more inefficient to converge. However, by applying feature scaling we can fix this. The goal of feature scaling is to get all the feature values near the range of -1 and 1. By applying various operations to values of a feature, will immensely speed up the gradient descent process.

Some might also apply mean normalization, which essentially uses the following form to satisfy: -0.5 ≤ x ≤ 0.5, where x is our feature value. This is done by subtracting the entry value and the average value of that feature, then dividing the result by the largest entry of that feature.

Learning Rate

Various problems may arise when adjusting the learning rate. If you set the learning rate too big, you may overshoot the correct value and end up diverging from the minimum. However, if you set the learning rate too small, it will take much longer for your algorithm to find the minimum parameter values. In short, make sure that your learning rate results in a gradual decrease in the cost function as you iterate over your entry values. (If the learning rate still seems confusing, please reference my article on univariate gradient descent, it covers this topic and others that I’ve referenced so far in my article)

Normal Equation

A normal equation is another algorithm you can use for linear regression that explicitly tells you the parameters for the minimum cost.

Let me provide a more concrete example, let’s say I had a 2x2 matrix called X filled with some numbers and a vector Y with 2 numbers. I would calculate the product of the matrix X and its transpose. I will then take the inverse of that product and multiply it by the product of vector Y and the transpose of matrix X. The resulting vector will contain the values of the parameters of an equation which will result in the lowest cost when plotted with the data from both matrix X and vector Y. Whereas gradient descent would iterate over and over again to find the parameters' values, the normal equation does it in one shot.

Now you might be thinking why I even bothered explaining gradient descent when we could’ve just used the normal equation. Well…there are pros and cons of using both methods. When using the normal equation, you do not need to choose a learning rate (⍺), which could risk divergence, and you do not need to iterate. However, if you have large datasets which tons and tons of data, gradient descent becomes the better option. Because you need to compute the (X^T * X)^-1 value, the cost of running this computation is about O(n³). (This notation is also called Big O notation, check this video about it for clarification) Anyways, if you had 10,000 features, that means computing with approximately 10¹² entries — a costly computation even by modern standards.

Well, there you have it — multivariate linear regression. This was certainly a lot of material to cover to I strongly suggest you read over this article and explore the links I have suggested to gain a firmer grasp of the different concepts. If you have been following my series, I’m proud to say that you will sound much more knowledgeable when anyone brings up machine learning — trust me, it happens a lot more often than you might think 😁. I hope you enjoyed this article, follow me on this account for future articles!

More about me — my name is Rohan, I am a 16-year-old high school student learning about disruptive technologies and I’ve chosen to start with A.I. To reach me, contact me through my email or my LinkedIn. I’d be more than glad to provide any insight or to learn the insights that you may have. Additionally, I would appreciate it if you could join my monthly newsletter. Until the next article 👋!

Gain Access to Expert View — Subscribe to DDI Intel

--

--

Hey everyone! My name is Rohan, a Third Year student at the University of Waterloo learning about Artificial Intelligence.