Motivation

Decision trees work well with discrete/categorical inputs with low options. However, it does not work well if there are a lot of continuous inputs.

Logistic regression aims to use the continuous value output probability for classification.

Thus, to utilise logistic regression as the classifier, we have to determine the step function to identify which points should be given which classification, i.e. a function which returns the value or for a binary classifier.

Naive thresholdNot differentiableCompletely confident predictionLogistic FunctionDifferentiableSofter boundaries

However, there are multiple issues related to the naive threshold.

  • Hypothesis is not differentiable, and is a discontinuous function, making it unpredictable
  • The classifier always announces a completely confident value, even for examples close to the boundary.

Softening the threshold function, we can then approximate the hard threshold with a continuous, differentiable function, that is similar in shape. The logistic function, also known as the sigmoid function can be seen as:

which effectively makes the hypothesis function

The sigmoid function is differentiable, with derivative:

Measuring Fit

For linear regression, the loss function was used.

Using the MSE loss function for logistic regression,

Note that the exponential function causes the function to be non-linear, which makes it non-convex. Thus, if we use the same gradient descent function to find it, we can stumble towards a local minima.

Cross-Entropy

The cross-entropy for classes gives the value:

where refers to the true value, and refers to the predicted value.

Building on from that, the binary cross-entropy ( on 2 classes) can be then calculated:

Thus, the binary cross entropy loss can be computed:

In matrix form, this can be seen:

Logistic Regression with Many Attributes

Given features, the hypothesis becomes:

where , and with the weight update:

γ

When dealing with non-linear decision boundaries, first consider the general form of a linear regression,

where refers to a transformed feature - perhaps .

Thus, the decision boundary then becomes for logistic regression,

Multi-class Classification

This section refers to how classification is done when there are multiple classes.

One vs All

In this method, the probability returned by the is tested with all of the classes, and the highest probability is the class that is assigned to the prediction.

hypothesishypothesishypothesisHIGHESTOne VS All

One vs One

Alternatively, the classifier is fit for every pair, and the one with the most wins is chosen as the class.

onevsone

Performance Measure

TPR/FPR

Referring back to the confusion matrix:

True PositiveFalse Positive (Type I error)
False Negative (Type 2 error)True Negative

The True Positive Rate, and the False Positive Rate is calculated simply as,

Receiver Operator Characteristic (ROC) Curve

RandomTPRFPR1 - specificitysensitivityGood ModelPerfect ModelBad Model

The ROC curve is the curve plotted by plotting the True Positive Rate against the False Positive Rate. The model is considered more accurate than random chance if its ROC curve is above the diagonal random line.

For a more concise metric, consider the Area Under Curve (AUC) of ROC:

TPRFPR1 - specificitysensitivityAUC = 0.5AUC > 0.5

The AUC is a concise metric, enabling clearer comparisons.

Interpretation

means model is better than chance. means model is very accurate.

Model Evaluation

To determine how good a model is, there are multiple questions to address:

  • which hyperparameters are picked
  • which features are picked
  • how is hypothesis picked

The goodness of a model/hypothesis is measured as follows:

where refers to any error functions such as , , cross-entropy.

Training SetValidation SetTest SetModel 1Model 2Choose model withminimum errorAssess model with errorof test setData

We cannot just use two sets of data, and assess the model fully based on the error of the test set, as this can cause bias.

Bias and Variance

Low biasLow varianceHigh biasLow varianceLow biasHigh varianceHigh varianceHigh bias

Bias

The difference between the estimator’s expected value and the true value of the parameter being estimated.

High bias can cause algorithms to miss relevant relations between features and target outputs, resulting in underfitting.

Variance

Error from sensitivity to small fluctuations in training set

High variance can happen from algorithm modelling random noise in the training data, resulting in overfitting.

Consider three models:

This might result in the following:

h1h2h3UnderfittingOverfittingHigh biasHigh variance

Hyperparameter Tuning

To find the best model:

hyperparameters = pick_hyperparameters
model = train(hyperparameters, data)
evaluate(model)

The methods for hyperparameter tuning are:

  • grid search (exhaustively try all possible hyperparameters)
  • random search (randomly search hyperparameters)
  • successive halving (use all possible hyperparameters, with reduced resources, and successively increase them with smaller set of hyperparameters)
  • Bayesian optimisation
  • evolutionary algorithms