Perceptron

The perceptron takes in inputs with weights, gets a sum and passes it through a sign function to generate an output.

Given an input vector of dimension , the perceptron model is defined as:

where is the sign function:

Learning Algorithm

Initialise weights
Loop (until convergence/max steps)
- For each instance (), classify
- Select a misclassified instance
- Update weights
  - is the learning rate

If the data is not linearly separable, the algorithm will not converge.

Why does the learning algorithm work?

Consider when there are misclassifications:

Case 1: Positive predicted as negative

The current calculation obtained is:

What we require is:

Thus, to “fix” our model, we have to reduce our . This can be done by adding to ,

Case 2: Negative as positive

The current calculation obtained is:

What we require is:

Thus, to “fix” our model, we have to increase our . This can be done by subtracting from ,

Neuron

Neuron

A generalised version of the perceptron - the building block of neural networks.

Sign function

This function is seen in the perceptron model.

Sigmoid function

This function is used to convert a linear regression model to a logistic regression model, making a classifier from values from regression.

tanh

ReLU

Leaky ReLU

Maxout

ELU

Neural network

Single-Layer

We can use a single-layer of neurons to simulate simple boolean functions.

For example, given a OR function, we have the following inputs:

x1	x2	0R
0	0	0
0	1	1
1	0	1
1	1	1

We can derive the relevant weights by considering the model:

Thus, we can get the following inequalities from the inputs:

We can then derive a set of weights that passes these criteria.

Multi-Layer

However, some boolean functions are not linearly separable, like XNOR.

We can then model these functions by using multiple layers of neurons - for example:

XNOR = NOR, AND

Neural network vs Logistic/linear regression model

Logistic/linear regression relies on manual feature engineering to capture complex patterns, while a multi-layer neural network learns its own feature representations through its hidden layers and non-linear activations.

XNOR xy from x,y).

The XNOR model can have hidden layers to simulate the NOR and AND layers, while feature engineering would be needed to capture this pattern in the model (new feature

Forward Propagation

Forward propagation

Process in a neural network where the input data is passed through the network’s layers to generate an output.

Forward propagation is used to do predictions.

Matrix multiplication can be used to get the outputs here, for example, imagining the model above (with no other layers):

Multi-class classification

class classification can be used for neural networks.

Given a vector , the softmax function computes the output for each as:

where

Gradient Descent

Chain rule

Given the composition of functions, we can compute a derivative, for example:

can be derived with regards to :

Multiple input

Given an equation,

we can still get the derivative of , w.r.t , by first introducing the intermediate variables:

and then get the derivative:

Gradient computation

Singular neuron

Given a singular neuron, we can

generate the predicted value for a given data point
define the loss function
differentiate using chain rule

For example. given the activation function , we can define:

and then compute the gradient of the loss function:

Backpropagation

Used to compute the gradient of the loss function with respect to each weight.

Thus, to find the weight , we can consider the derivative of the loss with regards to the weight:

A forward pass gets us all the intermediary results as seen above, and a backward pass gets us all the intermediary derivatives of loss, with regards to the particular intermediary result.

Issues

Overfitting

Dropout

Dropout prevents overfitting by randomly setting some neuron outputs to 0. This prevents the neural network from hard-memorising the pattern as seen in the data.

Early Stopping

While training - stop training when the validation and training loss is at a minimum.

Vanishing/Exploding Gradient

Vanishing gradient

Small gradients multipled repeatedly until zero

Solution

Change activation functions.

Exploding gradient

Gradients are multiplied again and again until overflowing.

Solution

Clip gradient within a range.

Explorer

Neural Networks

Perceptron

Learning Algorithm

Neuron

Sign function

Sigmoid function

tanh

ReLU

Leaky ReLU

Maxout

ELU

Neural network

Single-Layer

Multi-Layer

Forward Propagation

Multi-class classification

Gradient Descent

Chain rule

Multiple input

Gradient computation

Singular neuron

Backpropagation

Issues

Overfitting

Dropout

Early Stopping

Vanishing/Exploding Gradient

Vanishing gradient

Exploding gradient

Graph View

Table of Contents

Backlinks