Motivation

RNNs are used for sequential data, where:

Sequential data

Refers to data where the order of elements matters, and each element depends on its position in the sequence.

Examples of data where this matters are:

  • text data (sentences)
  • audio/video data

One-Hot Encoding

For each word, create a vector with length equal to the vocabulary size, Each word is then assigned a unique index, and its corresponding vector is all zeros except for a 1 in the position of that index.

For example: given the sentence we saw this saw, the one-hot encoded vector is something like:

[
	# words in order
	[1, 0, 0, 0], # we
	[0, 1, 0, 1], # saw
	[0, 0, 1, 0]  # this
]

Why does a simple neural network not work?

For example, when predicting word classes in a sentence, due to grammatical and syntax rules, the same word can change meaning/class based on the context. Thus, a recurrent neural network is needed.

Recurrent Neural Networks

</style>

Thus, at every time step , the hidden state is represented as:

and that hidden state is updated constantly with:

and the value of the prediction is:

The general architecture looks like:

Hidden Statet = 0x1RNN LayerHidden Statet = 1Output Layery1x2RNN LayerHidden Statet = 2Output Layery2Next LayersMany to ManyHidden Statet = 0x1RNN LayerHidden Statet = 1Output Layery1RNN LayerHidden Statet = 2Output Layery2Next LayersOne to ManyxTx = TyTx = 1Hidden Statet = 0x1RNN1Hidden Statet = 1RNN1Hidden Statet = TxRNN2Many to ManyTx != Tyx2BeginHidden Statet = 1Output Layery1RNN2Hidden Statet = TyOutput LayeryTy

Captures context information

Prediction at t requires all 1.. t-1 to be complete, which is not parallelism-friendly.

Self Attention Layer

A self-attention layer captures contextual information from other parts of the input sequence.

AttentionQueryKey

The query and key weights are stored in the weights .

Thus, for each feature , the query can be calculated:

Then the key for another feature can be calculated:

The attention score refers to how much context the feature provides to the feature .

A simplified expansion of the self-attention layer over two features can be seen:

x1x2qkv111a11Softmaxa'11qkv222a12a'12Weighted SumWeighted Sumh1

Step 1: Linear projection

We can use matrix multiplication to generate the matrix.

For example:

Step 2: Compute attention scores

Similarly with the same logic, since the attention score is calculated as , we can use the matrix multiplication:

The attention matrix will look something like this:

Step 3: Apply softmax, column-wise into the attention matrix

Step 4: Get weighted sum

We can get a singular output by:

Use matrix multiplication on the and softmaxed attention score matrix to get all the outputs.

Positional Encoding

Add positional information to differentiate.

Positional encoding

Explicitly injects positional information into original input features.

where refers to the positional encoding vector.

The positional encoding vector is generated using and functions: