Linear regression is used for the hypothesis space for linear functions of continuous-valued inputs.
Univariate linear regression
The function with input and output ,
where are real-valued coefficients to be learned is a univariate linear function.
This function aims to create the best fit line, finding that fits the data well.
To measure the fit of this best fit line, for a set of examples,
we can compute the loss function as follows:
Note that:
- the prediction of is the result of the linear function
In matrix form, let be the input vector, be the function matrix, be the output vector. Then, the error matrix can be seen as:
Thus, using the dot product, we can get the scalar value , the sum of squared error, by doing the dot product .
To minimise the loss, we can
naively enumerate all possible lines and find the one with the lowest mean squared error
find the minimum mean squared error by considering loss landscape
To find the minimum mean squared error, we can differentiate the loss function, with regard to the weight :
The \frac{1}{2} is for mathematical convenience.
In matrix form, this can then be seen as
Multivariate linear regression
The hypothesis becomes:
Normal Equation
The cost function - the mean squared error - can be rewritten into matrix form:
Note that
refers to the predictions
refers to the actual values
refers to the matrix of errors
To get the sum of squared errors, we can get the dot product of and its transpose .
We can then expand this to get
Note that these are dot products, so , and similarly .
Minimising , compute the derivative of with respect to , and find the value when it is (at minimum):
Assuming that is invertible, we can get :
Pros
Feature scaling is unnecessary
Single iteration
No need to find learning rate γ
Cons
Slower as calculating the inverse of can be a problem of
needs to be an invertible matrix
Feature Transformations
XXXXXXXXXXXXXXXXXXX-100100xyExample DataXXXXXXXXXXXXXXXXXXX-11xyFeature ScalingXXXXXXXXXXXXXXXXXXX-11x^2yFeature EngineeringThis specialised case is calledpolynomial regression
Feature engineering
Feature engineering
Based on existing features, new features are created.
Polynomial features: creation of new features , where is the polynomial degree.
Log features: creation of new features
Exponential features; creation of new features
Feature scaling
Feature scaling
The features are scaled to be represent the relative distance between each feature.
mean normalisation (standardisation)
min-max scaling
robust scaling
An alternative to feature scaling would be to use separate learning rate \upgamma for different features.
Gradient descent
Basic concepts
Vectors:
column vector and row vector =
dot product
the partial derivative of a loss function by
the gradient: the vector of all the partial derivatives
MSE loss function is convex for linear regression
There is only one minimum, which acts as the global maximum
Idea
Start at a weight
Pick a nearby that reduces the loss
γ
γ is the learning rate - the parameter that controls how much a model adjusts its parameter during training
Repeat until the minimum is found
directionhow much it moves in the direction isdetermined by the learning rateloss function
When calculating gradient descent for multiple variables, make sure not to update the variables before calculating the step size for all the variables.
Convexity
A convex function has a single global minimum
MSE loss function is convex for linear regression.
MSE loss function is convex for polynomial regression, as feature transformations do not modify the linear regression model, but just the features themselves. Hence, convexity is unaffected.
Variants
Stochastic gradient descent
Randomly selects one training example at each step and updates according to the gradient descent equation.
Mini-batch gradient descent
Chooses a mini-batch of size from examples.
As is proportional to , the standard error increases by a factor of , meaning even if there are more steps, it can still be faster than full batch stochastic gradient descent.
As the amount of datapoints decreases, the cost of gradient descent is cheaper per iteration
As the amount of datapoint decreases, the randomness increases, causing a possibility of escaping local minima.