Linear regression is used for the hypothesis space for linear functions of continuous-valued inputs.

Univariate linear regression

The function with input and output ,

where are real-valued coefficients to be learned is a univariate linear function.

This function aims to create the best fit line, finding that fits the data well.

To measure the fit of this best fit line, for a set of examples,

we can compute the loss function as follows:

Note that:

- the prediction of is the result of the linear function

In matrix form, let be the input vector, be the function matrix, be the output vector. Then, the error matrix can be seen as:

Thus, using the dot product, we can get the scalar value , the sum of squared error, by doing the dot product .

To minimise the loss, we can

naively enumerate all possible lines and find the one with the lowest mean squared error
find the minimum mean squared error by considering loss landscape

To find the minimum mean squared error, we can differentiate the loss function, with regard to the weight :

The $\frac{1}{2}$ is for mathematical convenience.

In matrix form, this can then be seen as

Multivariate linear regression

The hypothesis becomes:

Normal Equation

The cost function - the mean squared error - can be rewritten into matrix form:

Note that

refers to the predictions
refers to the actual values
refers to the matrix of errors

To get the sum of squared errors, we can get the dot product of and its transpose .

We can then expand this to get

Note that these are dot products, so , and similarly .

Minimising , compute the derivative of with respect to , and find the value when it is (at minimum):

Assuming that is invertible, we can get :

Pros

Feature scaling is unnecessary

Single iteration

No need to find learning rate $γ$

Cons

Slower as calculating the inverse of can be a problem of

needs to be an invertible matrix

Feature Transformations

Feature engineering

Feature engineering

Based on existing features, new features are created.

Polynomial features: creation of new features , where is the polynomial degree.
Log features: creation of new features
Exponential features; creation of new features

Feature scaling

Feature scaling

The features are scaled to be represent the relative distance between each feature.

mean normalisation (standardisation)
min-max scaling
robust scaling

An alternative to feature scaling would be to use separate learning rate $\upgamma$ for different features.

Gradient descent

Basic concepts

Vectors:

column vector and row vector =
dot product
the partial derivative of a loss function by
the gradient: the vector of all the partial derivatives

MSE loss function is convex for linear regression

There is only one minimum, which acts as the global maximum

Idea

Start at a weight
Pick a nearby that reduces the loss

γ

$γ$ is the learning rate - the parameter that controls how much a model adjusts its parameter during training
Repeat until the minimum is found

When calculating gradient descent for multiple variables, make sure not to update the variables before calculating the step size for all the variables.

Convexity

A convex function has a single global minimum

MSE loss function is convex for linear regression.

MSE loss function is convex for polynomial regression, as feature transformations do not modify the linear regression model, but just the features themselves. Hence, convexity is unaffected.

Variants

Stochastic gradient descent

Randomly selects one training example at each step and updates according to the gradient descent equation.

Mini-batch gradient descent

Chooses a mini-batch of size from examples.

As is proportional to , the standard error increases by a factor of , meaning even if there are more steps, it can still be faster than full batch stochastic gradient descent.

As the amount of datapoints decreases, the cost of gradient descent is cheaper per iteration

As the amount of datapoint decreases, the randomness increases, causing a possibility of escaping local minima.

Explorer

Linear Regression

Univariate linear regression

Multivariate linear regression

Normal Equation

Feature Transformations

Feature engineering

Feature scaling

Gradient descent

Basic concepts

Idea

Convexity

Variants

Graph View

Table of Contents

Backlinks