Motivation
RNNs are used for sequential data, where:
Sequential data
Refers to data where the order of elements matters, and each element depends on its position in the sequence.
Examples of data where this matters are:
- text data (sentences)
- audio/video data
One-Hot Encoding
For each word, create a vector with length equal to the vocabulary size, Each word is then assigned a unique index, and its corresponding vector is all zeros except for a 1
in the position of that index.
For example: given the sentence we saw this saw
, the one-hot encoded vector is something like:
[
# words in order
[1, 0, 0, 0], # we
[0, 1, 0, 1], # saw
[0, 0, 1, 0] # this
]
Why does a simple neural network not work?
For example, when predicting word classes in a sentence, due to grammatical and syntax rules, the same word can change meaning/class based on the context. Thus, a recurrent neural network is needed.
Recurrent Neural Networks
</style>
Thus, at every time step
and that hidden state is updated constantly with:
and the value of the prediction
The general architecture looks like:
Captures context information
Prediction at
t
requires all1.. t-1
to be complete, which is not parallelism-friendly.
Self Attention Layer
A self-attention layer captures contextual information from other parts of the input sequence.
The query and key weights are stored in the weights
Thus, for each feature
Then the key for another feature
The attention score
A simplified expansion of the self-attention layer over two features can be seen:
Step 1: Linear projection
We can use matrix multiplication to generate the
For example:
Step 2: Compute attention scores
Similarly with the same logic, since the attention score
The attention matrix will look something like this:
Step 3: Apply softmax, column-wise into the attention matrix
Step 4: Get weighted sum
We can get a singular output
Use matrix multiplication on the
Positional Encoding
Add positional information to differentiate.
Positional encoding
Explicitly injects positional information into original input features.
where
The positional encoding vector is generated using