Many-to-Many
Masked Self-Attention Layer
The mask depends on the input needed for the attention layer. For , the mask sets the first elements to , and otherwise.
For example:
---
title: Transformer
---
flowchart TD
subgraph Encoder
sa["Self-Attention"]-->ffn["Feed Forward Network"]
end
subgraph Decoder
sa2["Self-Attention"]-->ffn2["Feed Forward Network"]
ffn2-->endec["Encoder-Decoder Attention"]
end