Many-to-Many

Masked Self-Attention Layer

The mask depends on the input needed for the attention layer. For , the mask sets the first elements to , and otherwise.

For example:

mask for
mask for

Transformer

--- 
title: Transformer
--- 
flowchart TD 
	subgraph Encoder
	sa["Self-Attention"]-->ffn["Feed Forward Network"]
	end
	subgraph Decoder
	sa2["Self-Attention"]-->ffn2["Feed Forward Network"]
	ffn2-->endec["Encoder-Decoder Attention"]
	end