Master the Transformer encoder-decoder architecture. Learn to implement cross-attention and build complete Seq2Seq models for production-grade AI applications.
Previously in this course, we explored Implementing Multi-Head Attention and Positional Encoding Architectures. While those components form the individual "gears" of the Transformer, we haven't yet addressed how to assemble them into the classic encoder-decoder structure that powers machine translation, summarization, and other Seq2Seq tasks.
In this lesson, we move beyond individual layers to build the full Transformer architecture, specifically focusing on the cross-attention mechanism that allows the decoder to "look back" at the encoder's output.
The original Transformer architecture, often called the "Vaswani architecture," consists of two distinct stacks. The Encoder processes the input sequence into a rich, contextualized representation, while the Decoder generates an output sequence token-by-token, conditioned on both the previous output tokens and the encoder's representation.
The key to this communication is the Cross-Attention layer. Unlike Self-Attention, where Query, Key, and Value come from the same sequence, Cross-Attention uses:
The cross-attention mechanism is identical to standard self-attention, but the source of the K and V tensors changes. Here is how we define this in PyTorch:
PYTHONimport torch import torch.nn as nn import torch.nn.functional as F class CrossAttention(nn.Module): def __init__(self, d_model, n_head): super().__init__() self.mha = nn.MultiheadAttention(embed_dim=d_model, num_heads=n_head, batch_first=True) def forward(self, x, encoder_output, mask=None): # x: Decoder input(queries) # encoder_output: Memory from encoder(keys and values) # We project Q from x, and K/V from encoder_output attn_output, _ = self.mha(query=x, key=encoder_output, value=encoder_output, attn_mask=mask) return attn_output
To construct the full model, we wrap these blocks into a high-level class. This architecture is the foundation for the project milestone we will tackle in the next lesson.
Flow diagram: Input Sequence → Encoder Stack; Encoder Stack → Encoder Memory; Deco Target Sequence → Decoder Stack; Encoder Memory → K, V C Cross-Attention; Decoder Stack → CrossAttn; CrossAttn → Output Probability
Every decoder layer must perform these steps in sequence:
Construct a DecoderLayer class that accepts d_model and n_head.
nn.MultiheadAttention for the masked self-attention.CrossAttention layer using the code above.forward pass, ensuring you apply a causal (triangular) mask to the self-attention layer but not to the cross-attention layer.d_model in your encoder stack matches the d_model used in the decoder's cross-attention projection.The Transformer encoder-decoder design is a sophisticated message-passing system. By decoupling the memory (Encoder) from the generation process (Decoder) and linking them via Cross-Attention, we enable the model to handle complex sequence-to-sequence mappings. Focus on maintaining the integrity of the encoder's memory throughout the decoding process.
Up next: We will begin our Project Milestone: Custom Architecture Setup, where we move from theory to a production-ready model configuration.
Master the implementation of a production-ready Transformer architecture in PyTorch. Learn to define robust configuration schemas and initialize model weights.
Read moreMaster the Attention Mechanism by implementing Multi-Head Attention from scratch. Learn to code scaled dot-product attention and causal masks in PyTorch.
Transformer Encoder-Decoder Design