Move beyond ReLU. Learn to implement SwiGLU activation layers from first principles to boost representational capacity in your next transformer architecture.
Previously in this course, we explored Residual Connections and Gradient Stability in Deep Learning to ensure deep networks remain trainable. Now that we have stable gradient flow, we must address the "what" of our nonlinearities. While ReLU served us well for a decade, modern LLMs rely on dynamic gating mechanisms to increase representational capacity.
Standard activations like ReLU or GeLU are point-wise operators: $f(x)$ depends only on the value of $x$ at a specific position. This is computationally cheap but structurally rigid.
Gated Linear Units (GLUs) change this by introducing a multiplicative interaction. A GLU splits the input into two projections, $A$ and $B$, applying an activation function to one to act as a "gate" for the other: $$\text{GLU}(x, W, V, b, c) = \sigma(xW + b) \otimes (xV + c)$$
By using the input to control the flow of information through a secondary path, the network gains the ability to perform input-dependent feature selection.
The SwiGLU activation, popularized by the PaLM and LLaMA architectures, replaces the sigmoid gate with the Swish (or SiLU) function. Swish is defined as $x \cdot \sigma(\beta x)$. When used in a GLU, it provides a smoother, non-monotonic surface that helps with optimization.
Here is how we implement the SwiGLU layer in PyTorch:
PYTHONimport torch import torch.nn as nn import torch.nn.functional as F class SwiGLU(nn.Module): def __init__(self, in_features, hidden_features): super().__init__() # We project to hidden_features * 2 because we need two paths self.w1 = nn.Linear(in_features, hidden_features) self.w2 = nn.Linear(in_features, hidden_features) self.v = nn.Linear(hidden_features, in_features) def forward(self, x): # x is the input, split into gate and value paths gate = F.silu(self.w1(x)) value = self.w2(x) return self.v(gate * value)
Why move away from ReLU? The primary difference lies in the interaction between dimensions.
| Feature | ReLU | GeLU | SwiGLU |
|---|---|---|---|
| Logic | Thresholding | Soft-thresholding | Input-dependent gating |
| Monotonicity | Monotonic | Non-monotonic | Non-monotonic |
| Capacity | Low | Medium | High |
| Computation | Minimal | Low | Moderate |
In a standard MLP block, ReLU simply zeroes out negative values. In contrast, SwiGLU allows the model to decide—based on the input context—which features are relevant at that specific moment. This dynamic weighting is why modern Transformer Feed-Forward Networks (FFNs) almost exclusively use gated variants.
To integrate this into our running project, we replace the standard FFN in our Transformer blocks. Instead of a simple Linear -> ReLU -> Linear stack, we use a Linear -> SwiGLU -> Linear pattern.
Note that because SwiGLU involves a hidden projection, you must adjust your hidden dimension size. Typically, we use a multiplier (e.g., 4/3 of the embedding dimension) to account for the increased parameter count, ensuring we stay within our compute budget as discussed in later lessons.
hidden_features are divisible by the required factors if you are using fused kernels.Refactor a standard nn.Sequential block that uses nn.ReLU into a custom GatedFFN module. Use the SwiGLU implementation above and verify that the input and output dimensions remain consistent. Once implemented, initialize the module with a known input tensor and print the gradient of the output with respect to the gate weights to confirm the gating mechanism is active.
We’ve moved beyond static point-wise activations to dynamic gating. By implementing SwiGLU, we provide our models with the mechanism to perform conditional feature selection, significantly increasing the representational capacity of our Transformer blocks.
Up next: We will implement Multi-Head Attention, where these gating insights will prove vital for managing information flow across sequences.
Master Mixture-of-Experts (MoE) layers to build scalable, compute-efficient LLMs. Learn to design expert routers, implement sparse layers, and balance load.
Read moreLearn how to build Multimodal Transformer architectures by integrating vision encoders into LLMs. Master cross-modal alignment and multimodal attention.
Gating Units and Activation Functions