Summary

  • Problem of overfitting
  • Linear regression with regularisation

General Idea

When doing machine learning, the target is a good hypothesis - good being defined as a hypothesis that directly minimises the weighted sum of empirical loss and the complexity of the hypothesis - effectively:

A complex model fits all data points including the noise in the data, making it that the learned model badly generalises to unseen data as it does not capture underlying ground truth.

Different Views to Overfitting

Considering a model :

  • the higher the is, the more the model overfits
  • a solution to this would be to keep weights small during optimisation of the weights
  • put a cost on having large weights
    • loss function (put in a cost of having large weights)

Solution

Regularization

Explicitly penalising complex hypotheses: look for functions that are more regular.

Overfit Logistic RegressionOverfit Linear Regression

There are two ways to address overfitting.

  • Reducing the number of features in the polynomial (feature selection)
  • Reduce magnitudes of certain features while keeping all features.

Linear Regression with Regularization

Given the hypothesis ,

The cost function with (L2) regularization is:

Penalty Function

In this scenario, the added term:

is the L2 penalty function , with the penalty strength (regularization parameter) , also known as ridge regression.

The L1 penalty function (lasso regression)

The L1 penalty function (sum of absolute weights).

Thus, the new optimisation goal becomes

No regularization at allHigh regularizationparameter

Thus, we can take the hypothesis as:

with the cost function:

and the gradient descent update rule is then updated as such:

γγ

Gradient Descent

In this scenario, the optimisation goal is still . The gradient descent repeats this updates, as seen here:

γγ

The generalised update rule can be read as:

γγγγ

Note that the parameters are slightly shrinking here, as γ replaces the coefficient in the original weight update rule, which consistently decreases.

Normal Equation

With the loss function,

we can represent the equation in matrix form:

where is a diagonal matrix:

without in to ensure bias term is not regulated.

Adding ensures is always invertible provided , making solution always computable.

Why is it (X^{T}X + \lambda R) always invertible?

  • is symmetric and positive semi-definite (all eigenvalues are as )
  • is symmetric and positive definite
  • Adding a positive definite matrix to a semi-definite matrix results in a positive definite matirx, which is always invertible.

Thus, normal equation with regularisation works no matter whether there is linear dependency among the features or insufficient number of observations.

Logistic Regression with Regularization

The hypothesis of a logistic regression model is seen as such:

with the cost function (given that )

with gradient descent update rule looking like such:

γγ