Learn to implement Rotary Positional Embeddings (RoPE) from scratch. We compare absolute, relative, and rotary methods for robust sequence length extrapolation.
Previously in this course, we explored Implementing Multi-Head Attention: A Deep Dive into Transformers, where we built the core mechanism that allows models to weigh the importance of different tokens. However, the standard attention mechanism is permutation-invariant—it treats a sequence as a "bag of words." To make sense of order, we need positional encoding. This lesson moves beyond static offsets to implement Rotary Positional Embeddings (RoPE), the current industry standard for sequence modeling.
Transformers require a way to inject spatial information because the attention mechanism calculates compatibility between all pairs of tokens regardless of their position.
Historically, we used two main approaches:
RoPE bridges the gap by encoding absolute positions using a rotation matrix, which naturally induces relative position information into the attention dot-product. Mathematically, RoPE rotates each query and key vector in 2D planes. The inner product of two rotated vectors becomes a function of their relative distance, effectively combining the benefits of absolute and relative approaches.
To implement RoPE, we define a rotation frequency for each dimension. For a head dimension $d$, we pair dimensions into $d/2$ groups, each rotated by a specific angle $\theta_i$.
PYTHONimport torch import torch.nn as nn class RoPE(nn.Module): def __init__(self, dim: int, max_seq_len: int = 2048, theta: float = 10000.0): super().__init__() # Precompute frequencies inv_freq = 1.0 / (theta ** (torch.arange(0, dim, 2).float() / dim)) t = torch.arange(max_seq_len) freqs = torch.einsum("i,j->ij", t, inv_freq) # Create complex polar form self.emb = torch.polar(torch.ones_like(freqs), freqs) def forward(self, x: torch.Tensor): # x shape: [batch, seq_len, head_dim] # Convert to complex and multiply x_complex = torch.view_as_complex(x.float().reshape(*x.shape[:-1], -1, 2)) x_rotated = x_complex * self.emb[:x.shape[1], :].unsqueeze(0) return torch.view_as_real(x_rotated).flatten(2)
When integrating RoPE into your Multi-Head Attention layers, you apply the rotation to the Query ($Q$) and Key ($K$) tensors after the linear projection but before the dot-product calculation. The Value ($V$) tensors are not rotated, as they represent the content to be aggregated, not the positional relationship.
| Method | Sequence Extrapolation | Compute Cost | Implementation |
|---|---|---|---|
| Absolute | Poor | Negligible | Simple addition |
| Relative | Good | Moderate | Bias addition |
| RoPE | Excellent | Low | Matrix rotation |
A major challenge in modern LLM deployment is "context window extension." Because RoPE is based on periodic trigonometric functions, we can perform "Linear Scaling" or "NTK-Aware Scaling" by interpolating the frequency base $\theta$. By slightly increasing the base value, the model can effectively "stretch" its understanding of positions to sequences longer than those encountered during the initial training phase.
RoPE class provided above.__init__ method that allows you to divide the inv_freq by a factor of 2. Observe how this affects the dot-product values for positions beyond the initial max_seq_len.float32 or bfloat16 to maintain numerical stability during long-sequence attention, even if your model is in float16.torch.view_as_complex, ensure your hidden dimension is even. If you have an odd number of dimensions, you must either pad or ignore the last dimension.We have moved beyond static positional embeddings, implementing the rotation-based mechanism that powers modern LLMs like Llama 3 and Mistral. By applying rotations to Query and Key vectors, we achieve the holy grail of sequence modeling: relative positional awareness with the efficiency of absolute positioning, enabling seamless sequence length extrapolation.
Up next: We will assemble our components into the full Transformer architecture in our study of Transformer Encoder-Decoder Design.
Master the transition from Batch and LayerNorm to RMSNorm. Learn to implement it from scratch and optimize training stability for your deep learning models.
Read moreLearn how to build Multimodal Transformer architectures by integrating vision encoders into LLMs. Master cross-modal alignment and multimodal attention.
Positional Encoding Architectures