Transformer Encoder-Decoder Design: Building Seq2Seq Models

Master the Transformer encoder-decoder architecture. Learn to implement cross-attention and build complete Seq2Seq models for production-grade AI applications.

TransformerSeq2SeqArchitectureDeep LearningPyTorchAttentionaimachine-learningpython

Previously in this course, we explored Implementing Multi-Head Attention and Positional Encoding Architectures. While those components form the individual "gears" of the Transformer, we haven't yet addressed how to assemble them into the classic encoder-decoder structure that powers machine translation, summarization, and other Seq2Seq tasks.

In this lesson, we move beyond individual layers to build the full Transformer architecture, specifically focusing on the cross-attention mechanism that allows the decoder to "look back" at the encoder's output.

The Encoder-Decoder Anatomy

The original Transformer architecture, often called the "Vaswani architecture," consists of two distinct stacks. The Encoder processes the input sequence into a rich, contextualized representation, while the Decoder generates an output sequence token-by-token, conditioned on both the previous output tokens and the encoder's representation.

The key to this communication is the Cross-Attention layer. Unlike Self-Attention, where Query, Key, and Value come from the same sequence, Cross-Attention uses:

Queries (Q): From the previous Decoder layer.
Keys (K): From the final Encoder output.
Values (V): From the final Encoder output.

The Structural Flow

Encoder: A stack of $N$ identical layers, each containing a Multi-Head Self-Attention sub-layer and a Feed-Forward network, connected via Residual Connections and Gradient Stability.
Decoder: A stack of $N$ identical layers, each containing:
- Masked Self-Attention (to prevent "peeking" into the future).
- Cross-Attention (the bridge to the encoder).
- Feed-Forward network.

Implementing the Cross-Attention Bridge

The cross-attention mechanism is identical to standard self-attention, but the source of the K and V tensors changes. Here is how we define this in PyTorch:


PYTHON
import torch
import torch.nn as nn
import torch.nn.functional as F

class CrossAttention(nn.Module):
    def __init__(self, d_model, n_head):
        super().__init__()
        self.mha = nn.MultiheadAttention(embed_dim=d_model, num_heads=n_head, batch_first=True)
        
    def forward(self, x, encoder_output, mask=None):
        # x: Decoder input(queries)
        # encoder_output: Memory from encoder(keys and values)
        # We project Q from x, and K/V from encoder_output
        attn_output, _ = self.mha(query=x, key=encoder_output, value=encoder_output, attn_mask=mask)
        return attn_output

Assembling the Seq2Seq Model

To construct the full model, we wrap these blocks into a high-level class. This architecture is the foundation for the project milestone we will tackle in the next lesson.


Flow diagram: Input Sequence → Encoder Stack; Encoder Stack → Encoder Memory; Deco Target Sequence → Decoder Stack; Encoder Memory → K, V C Cross-Attention; Decoder Stack → CrossAttn; CrossAttn → Output Probability

The Decoder Layer Structure

Every decoder layer must perform these steps in sequence:

Masked Self-Attention: Ensures the model only attends to past positions.
Add & Norm: Standard residual connection as discussed in Normalization Techniques at Scale.
Cross-Attention: Injects context from the encoder.
Feed-Forward: Gating Units and Activation Functions are applied here.

Hands-on Exercise

Construct a DecoderLayer class that accepts d_model and n_head.

Initialize a nn.MultiheadAttention for the masked self-attention.
Initialize a CrossAttention layer using the code above.
Implement the forward pass, ensuring you apply a causal (triangular) mask to the self-attention layer but not to the cross-attention layer.

Common Pitfalls

Masking Confusion: A common error is applying the causal mask to the Cross-Attention layer. Cross-attention is not causal; the decoder should be able to attend to any part of the encoder's input.
Dimensionality Mismatch: Ensure the d_model in your encoder stack matches the d_model used in the decoder's cross-attention projection.
Initialization: Transformers are sensitive to initial weights. If you haven't implemented Advanced Weight Initialization Strategies yet, do so before training this architecture, or you will likely encounter vanishing gradients.

Recap

The Transformer encoder-decoder design is a sophisticated message-passing system. By decoupling the memory (Encoder) from the generation process (Decoder) and linking them via Cross-Attention, we enable the model to handle complex sequence-to-sequence mappings. Focus on maintaining the integrity of the encoder's memory throughout the decoding process.

Up next: We will begin our Project Milestone: Custom Architecture Setup, where we move from theory to a production-ready model configuration.

Back to Blog