When doing machine learning, the target is a good hypothesis - good being defined as a hypothesis that directly minimises the weighted sum of empirical loss and the complexity of the hypothesis - effectively:
A complex model fits all data points including the noise in the data, making it that the learned model badly generalises to unseen data as it does not capture underlying ground truth.
Different Views to Overfitting
Considering a model :
the higher the is, the more the model overfits
a solution to this would be to keep weights small during optimisation of the weights
put a cost on having large weights
loss function (put in a cost of having large weights)
Solution
Regularization
Explicitly penalising complex hypotheses: look for functions that are more regular.
There are two ways to address overfitting.
Reducing the number of features in the polynomial (feature selection)
Reduce magnitudes of certain features while keeping all features.
Linear Regression with Regularization
Given the hypothesis ,
The cost function with (L2) regularization is:
Penalty Function
In this scenario, the added term:
is the L2 penalty function , with the penalty strength (regularization parameter) , also known as ridge regression.
The L1 penalty function (lasso regression)
The L1 penalty function (sum of absolute weights).
Thus, the new optimisation goal becomes
Thus, we can take the hypothesis as:
with the cost function:
and the gradient descent update rule is then updated as such:
Gradient Descent
In this scenario, the optimisation goal is still . The gradient descent repeats this updates, as seen here:
The generalised update rule can be read as:
Note that the parameters are slightly shrinking here, as replaces the coefficient in the original weight update rule, which consistently decreases.
Normal Equation
With the loss function,
we can represent the equation in matrix form:
where is a diagonal matrix:
without in to ensure bias term is not regulated.
Adding ensures is always invertible provided , making solution always computable.
Why is it (X^{T}X + \lambda R) always invertible?
is symmetric and positive semi-definite (all eigenvalues are as )
is symmetric and positive definite
Adding a positive definite matrix to a semi-definite matrix results in a positive definite matirx, which is always invertible.
Thus, normal equation with regularisation works no matter whether there is linear dependency among the features or insufficient number of observations.
Logistic Regression with Regularization
The hypothesis of a logistic regression model is seen as such:
with the cost function (given that )
with gradient descent update rule looking like such: