Linear regression is used for the hypothesis space for linear functions of continuous-valued inputs.

Univariate linear regression

The function with input and output ,

where are real-valued coefficients to be learned is a univariate linear function.

This function aims to create the best fit line, finding that fits the data well.

XXXXXXXXXXXXBest-Fit Line

To measure the fit of this best fit line, for a set of examples,

we can compute the loss function as follows:

Note that:

  • - the prediction of is the result of the linear function

In matrix form, let be the input vector, be the function matrix, be the output vector. Then, the error matrix can be seen as:

Thus, using the dot product, we can get the scalar value , the sum of squared error, by doing the dot product .

To minimise the loss, we can

  • naively enumerate all possible lines and find the one with the lowest mean squared error
  • find the minimum mean squared error by considering loss landscape

To find the minimum mean squared error, we can differentiate the loss function, with regard to the weight :

The \frac{1}{2} is for mathematical convenience.

In matrix form, this can then be seen as

Multivariate linear regression

The hypothesis becomes:

Normal Equation

The cost function - the mean squared error - can be rewritten into matrix form:

Note that

  • refers to the predictions
  • refers to the actual values
  • refers to the matrix of errors

To get the sum of squared errors, we can get the dot product of and its transpose .

We can then expand this to get

Note that these are dot products, so , and similarly .

Minimising , compute the derivative of with respect to , and find the value when it is (at minimum):

Assuming that is invertible, we can get :

Pros

  • Feature scaling is unnecessary
  • Single iteration
  • No need to find learning rate γ

Cons

  • Slower as calculating the inverse of can be a problem of
  • needs to be an invertible matrix

Feature Transformations

XXXXXXXXXXXXXXXXXXX-100100xyExample DataXXXXXXXXXXXXXXXXXXX-11xyFeature ScalingXXXXXXXXXXXXXXXXXXX-11x^2yFeature EngineeringThis specialised case is calledpolynomial regression

Feature engineering

Feature engineering

Based on existing features, new features are created.

  • Polynomial features: creation of new features , where is the polynomial degree.
  • Log features: creation of new features
  • Exponential features; creation of new features

Feature scaling

Feature scaling

The features are scaled to be represent the relative distance between each feature.

  • mean normalisation (standardisation)
  • min-max scaling
  • robust scaling

An alternative to feature scaling would be to use separate learning rate \upgamma for different features.

Gradient descent

Basic concepts

Vectors:

  • column vector and row vector =
  • dot product
  • the partial derivative of a loss function by
  • the gradient: the vector of all the partial derivatives

MSE loss function is convex for linear regression

There is only one minimum, which acts as the global maximum

Idea

  • Start at a weight
  • Pick a nearby that reduces the loss
γ
  • γ is the learning rate - the parameter that controls how much a model adjusts its parameter during training
  • Repeat until the minimum is found
directionhow much it moves in the direction isdetermined by the learning rateloss function

When calculating gradient descent for multiple variables, make sure not to update the variables before calculating the step size for all the variables.

Convexity

A convex function has a single global minimum

MSE loss function is convex for linear regression.

MSE loss function is convex for polynomial regression, as feature transformations do not modify the linear regression model, but just the features themselves. Hence, convexity is unaffected.

Variants

Stochastic gradient descent

Randomly selects one training example at each step and updates according to the gradient descent equation.

Mini-batch gradient descent

Chooses a mini-batch of size from examples.

As is proportional to , the standard error increases by a factor of , meaning even if there are more steps, it can still be faster than full batch stochastic gradient descent.

As the amount of datapoints decreases, the cost of gradient descent is cheaper per iteration

As the amount of datapoint decreases, the randomness increases, causing a possibility of escaping local minima.