Many-to-Many

xxxSelf-Attention LayerOutput Layer12Txhhh12Txyyy12TxMany-to-ManyxxxSelf-Attention LayerOutput Layer12Txhhh12Txy1Many-to-OnehSUMhSUMxyy112hhh123y1One-to-Manyy2Self-Attention LayerOutput LayerMASKEDxyy112hhh123a1Many-To-Manya2Self-Attention LayerCross-Attention LayerMASKEDxxx12Txhhh12Txy1y2Output LayerSelf-Attention Layer

Masked Self-Attention Layer

x1x2qkv111a11Softmaxa'11qkv222a12a'12Weighted SumWeighted Sumh1MASKED-inf0

The mask depends on the input needed for the attention layer. For , the mask sets the first elements to , and otherwise.

For example:

  • mask for
  • mask for

Transformer

--- 
title: Transformer
--- 
flowchart TD 
	subgraph Encoder
	sa["Self-Attention"]-->ffn["Feed Forward Network"]
	end
	subgraph Decoder
	sa2["Self-Attention"]-->ffn2["Feed Forward Network"]
	ffn2-->endec["Encoder-Decoder Attention"]
	end