Perceptron

The perceptron takes in inputs with weights, gets a sum and passes it through a sign function to generate an output.

Given an input vector of dimension , the perceptron model is defined as:

where is the sign function:

Learning Algorithm

  • Initialise weights
  • Loop (until convergence/max steps)
    • For each instance (), classify
    • Select a misclassified instance
    • Update weights
      • is the learning rate

If the data is not linearly separable, the algorithm will not converge.

Why does the learning algorithm work?

Consider when there are misclassifications:

Case 1: Positive predicted as negative

The current calculation obtained is:

What we require is:

Thus, to “fix” our model, we have to reduce our . This can be done by adding to ,

Case 2: Negative as positive

The current calculation obtained is:

What we require is:

Thus, to “fix” our model, we have to increase our . This can be done by subtracting from ,

Neuron

Neuron

A generalised version of the perceptron - the building block of neural networks.

Sign function

This function is seen in the perceptron model.

Sigmoid function

This function is used to convert a linear regression model to a logistic regression model, making a classifier from values from regression.

tanh

ReLU

Leaky ReLU

Maxout

ELU

Neural network

inputsweightsbiasweight1sumactivationLayerweightsbiasweight1sumactivationLayeroutputoutputsumactivationLayersumactivationLayerbiasweight1biasweight1sumactivationLayerbiasweight1outputoutputoutput

Single-Layer

We can use a single-layer of neurons to simulate simple boolean functions.

For example, given a OR function, we have the following inputs:

x1x20R
000
011
101
111

We can derive the relevant weights by considering the model:

Thus, we can get the following inequalities from the inputs:

We can then derive a set of weights that passes these criteria.

Multi-Layer

However, some boolean functions are not linearly separable, like XNOR.

We can then model these functions by using multiple layers of neurons - for example:

  • XNOR = NOR, AND
ANDNOR

Neural network vs Logistic/linear regression model

Logistic/linear regression relies on manual feature engineering to capture complex patterns, while a multi-layer neural network learns its own feature representations through its hidden layers and non-linear activations.

XNOR xy from x,y).

The XNOR model can have hidden layers to simulate the NOR and AND layers, while feature engineering would be needed to capture this pattern in the model (new feature

Forward Propagation

Forward propagation

Process in a neural network where the input data is passed through the network’s layers to generate an output.

Forward propagation is used to do predictions.

W[1]W[..L]W[L]Forward Propagation

Matrix multiplication can be used to get the outputs here, for example, imagining the model above (with no other layers):

Multi-class classification

class classification can be used for neural networks.

....prediction for class 1prediction for class 2prediction for class cSoftmax activation

Given a vector , the softmax function computes the output for each as:

where

Gradient Descent

Chain rule

Given the composition of functions, we can compute a derivative, for example:

can be derived with regards to :

Multiple input

Given an equation,

we can still get the derivative of , w.r.t , by first introducing the intermediate variables:

and then get the derivative:

Gradient computation

Singular neuron

Given a singular neuron, we can

  • generate the predicted value for a given data point
  • define the loss function
  • differentiate using chain rule

For example. given the activation function , we can define:

and then compute the gradient of the loss function:

Backpropagation

Used to compute the gradient of the loss function with respect to each weight.

x1x2z1z2a1a2w1w2g(z1)g(z2)w3w4dL/x1dL/x2dL/dz1dL/dz2dL/da1dL/da2dL/dbdL/dyw1w2g'(z1)g'(z2)g'(b)w3w4Backwardx1x2z1z2a1a2byw1w2g(z1)g(z2)g(b)w3w4Forwardw3w4w3w4uvwdL/dudL/dvw

Thus, to find the weight , we can consider the derivative of the loss with regards to the weight:

A forward pass gets us all the intermediary results as seen above, and a backward pass gets us all the intermediary derivatives of loss, with regards to the particular intermediary result.

Issues

Overfitting

Dropout

Dropout prevents overfitting by randomly setting some neuron outputs to 0. This prevents the neural network from hard-memorising the pattern as seen in the data.

Early Stopping

While training - stop training when the validation and training loss is at a minimum.

Vanishing/Exploding Gradient

Vanishing gradient

Small gradients multipled repeatedly until zero

Solution

Change activation functions.

Exploding gradient

Gradients are multiplied again and again until overflowing.

Solution

Clip gradient within a range.