Master LoRA to fine-tune massive models on limited hardware. Learn to inject adapters, tune rank and alpha, and optimize parameter efficiency for production.
Previously in this course, we explored Fine-tuning Methodologies Overview: Strategies for LLM Adaptation, which established why full fine-tuning is often prohibitive for large-scale models. In this lesson, we move from theory to implementation by mastering Low-Rank Adaptation (LoRA), the industry-standard approach to Parameter-Efficient Fine-Tuning (PEFT).
In traditional fine-tuning, we update all $d \times d$ parameters in a weight matrix $W$. For a model with billions of parameters, this requires massive VRAM for gradients and optimizer states.
LoRA relies on the hypothesis that the update to the weights during adaptation, $\Delta W$, has a low "intrinsic rank." Instead of training $W$ directly, we freeze the pre-trained weights and inject two smaller matrices, $A$ and $B$, such that: $$W' = W + \Delta W = W + BA$$ Where $B \in \mathbb{R}^{d \times r}$ and $A \in \mathbb{R}^{r \times d}$, with the rank $r \ll d$. By keeping $r$ small (e.g., 8 or 16), we reduce the number of trainable parameters by orders of magnitude.
To implement LoRA, we target the projection layers within the Transformer blocks—typically the Query ($W_q$) and Value ($W_v$) projections in the self-attention mechanism.
When we perform a forward pass, the output $h$ becomes: $$h = Wx + BAx$$ During backpropagation, we only calculate gradients for $A$ and $B$, keeping $W$ locked. This drastically reduces the memory footprint, as we no longer store gradients or optimizer states for the original weights.
Using PyTorch, we can wrap a standard linear layer to behave as a LoRA adapter. Note how we initialize $A$ with Kaiming uniform and $B$ as zeros to ensure the adapter acts as an identity function at the start of training.
PYTHONimport torch import torch.nn as nn class LoRALayer(nn.Module): def __init__(self, original_layer, rank=8, alpha=16): super().__init__() self.original_layer = original_layer self.rank = rank self.alpha = alpha # In_features and out_features from the original layer d_in = original_layer.in_features d_out = original_layer.out_features # Low-rank matrices self.lora_A = nn.Parameter(torch.zeros(rank, d_in)) self.lora_B = nn.Parameter(torch.zeros(d_out, rank)) # Scaling factor: alpha / rank self.scaling = self.alpha / self.rank # Initialize nn.init.kaiming_uniform_(self.lora_A, a=5**0.5) nn.init.zeros_(self.lora_B) # Freeze original weights for param in self.original_layer.parameters(): param.requires_grad = False def forward(self, x): # Wx + (BA)x * scaling original_out = self.original_layer(x) lora_out = (x @ self.lora_A.T @ self.lora_B.T) * self.scaling return original_out + lora_out
The rank ($r$) dictates the capacity of the adapter. A higher rank allows the model to learn more complex nuances but increases the parameter count.
The alpha ($\alpha$) parameter acts as a constant scaling factor. It effectively controls the "influence" of the adapter on the base model's output. If you increase your training rank, you generally scale $\alpha$ proportionally (often $\alpha = 2r$ is a good starting point).
| Strategy | Memory Usage | Training Speed | Expressivity |
|---|---|---|---|
| Full Fine-Tuning | Very High | Slow | High |
| LoRA (r=8) | Low | Fast | Moderate |
| LoRA (r=64) | Moderate | Moderate | High |
LoRALayer above to target all linear layers in a simple nn.TransformerEncoderLayer.sum(p.numel() for p in model.parameters() if p.requires_grad).rank=4 vs rank=64 on a small dataset. Observe if the loss convergence differs significantly.requires_grad = False on the original weights, you are still training the full model, which defeats the purpose.LoRA provides a mathematically elegant way to perform parameter-efficient fine-tuning by decomposing weight updates into low-rank matrices. By choosing the right rank and alpha, you can achieve performance comparable to full fine-tuning while significantly reducing VRAM requirements. This is the cornerstone of modern LLM adaptation pipelines.
Up next: Quantized LoRA (QLoRA), where we push memory efficiency further by compressing the base model to 4-bit precision.
Master fine-tuning methodologies for LLMs. Learn to choose between full fine-tuning and PEFT based on your resource constraints and compute budget.
Read moreMaster the transition from Batch and LayerNorm to RMSNorm. Learn to implement it from scratch and optimize training stability for your deep learning models.
Parameter-Efficient Fine-Tuning (LoRA)