Mahamudul Hasan Rubel
HomeBlogCoursesAboutProjectsSkillsExperiencePhotosContact
Mahamudul Hasan Rubel

Senior Software Engineer crafting high-performance web applications and SaaS platforms.

Navigation

  • Home
  • Blog
  • Courses
  • About
  • Projects
  • Skills
  • Experience
  • Photos
  • Contact

Get in Touch

Available for senior/lead roles and consulting.

bd.mhrubel@gmail.comHire Me

Subscribe to the newsletter

Get new articles and course lessons delivered to your inbox. No spam, unsubscribe anytime.

© 2026 Mahamudul Hasan Rubel. All rights reserved.

Built with using Next.js 16 & Tailwind v4

Back to Blog
Lesson 17 of the Advanced AI/ML: Deep Learning, LLMs & Production Systems course
AI/MLJune 27, 20264 min read

Quantized LoRA (QLoRA): Fine-tuning Massive Models on Consumer GPUs

Learn how to use QLoRA to fine-tune massive LLMs on consumer hardware. Master 4-bit quantization, NF4, and memory-efficient training workflows.

QLoRAQuantizationMemory EfficiencyFine-tuningLLMsDeep Learningaimachine-learningpython

Previously in this course, we explored Parameter-Efficient Fine-Tuning (LoRA) for Large Language Models, which introduced the concept of injecting low-rank adapters into frozen model weights. While LoRA drastically reduces the number of trainable parameters, it still requires the base model to be loaded in 16-bit precision (FP16 or BF16), which remains a significant memory bottleneck.

In this lesson, we take that further by implementing QLoRA (Quantized LoRA). By combining 4-bit quantization with LoRA, we can shrink the footprint of massive models by roughly 4x, allowing you to fine-tune 70B parameter models on a single high-end consumer GPU.

QLoRA from First Principles

QLoRA works by freezing the pre-trained model weights and quantizing them to a 4-bit data type called NF4 (NormalFloat 4). NF4 is an information-theoretically optimal data type for normally distributed weights, which are standard in modern Transformers.

The workflow relies on two core innovations:

  1. 4-bit NormalFloat (NF4): A quantization data type that ensures each quantization bin has an equal number of values from the input tensor, preserving precision where it matters most.
  2. Double Quantization: A technique that quantizes the quantization constants themselves, saving an additional ~0.37 bits per parameter.

When you perform a forward pass in QLoRA, the weights are dequantized on-the-fly to the computation precision (usually BF16) to perform matrix multiplication. This keeps the training compute in high precision while keeping the memory footprint in low precision.

Implementing QLoRA Workflows

To implement QLoRA, we use the bitsandbytes library alongside peft and transformers. The process involves wrapping your model in a BitsAndBytesConfig before loading it.

Worked Example: 4-bit Loading

PYTHON
import torch
from transformers import AutoModelForCausalLM, BitsAndBytesConfig
from peft import get_peft_model, LoraConfig

# 1. Configure the quantization
bnb_config = BitsAndBytesConfig(
    load_in_4bit=True,
    bnb_4bit_quant_type="nf4",      # Use NF4 for better precision
    bnb_4bit_compute_dtype=torch.bfloat16, # Compute in BF16 for stability
    bnb_4bit_use_double_quant=True  # Double quantization for extra memory savings
)

# 2. Load the base model in 4-bit
model = AutoModelForCausalLM.from_pretrained(
    "meta-llama/Llama-2-7b-hf",
    quantization_config=bnb_config,
    device_map="auto"
)

# 3. Inject LoRA adapters
lora_config = LoraConfig(
    r=16,
    lora_alpha=32,
    target_modules=["q_proj", "v_proj"],
    lora_dropout=0.05,
    bias="none",
    task_type="CAUSAL_LM"
)

model = get_peft_model(model, lora_config)
model.print_trainable_parameters()

Optimizing Memory Usage

While QLoRA handles the model weights, you must still be mindful of your activation memory. Because we are now squeezing the model into a smaller space, you might be tempted to increase your batch size, but remember that activations scale linearly with batch size and sequence length.

StrategyMemory ImpactTrade-off
4-bit NF4High reductionSlight loss in model perplexity
Double QuantModerate reductionMinimal overhead
Gradient CheckpointingMassive reductionIncreases compute (slower training)
Paged OptimizersPrevents OOMsMinor latency hit

If you encounter Out-Of-Memory (OOM) errors, enable gradient_checkpointing=True in your TrainingArguments. This trades compute for memory by recomputing activations during the backward pass rather than storing them.

Hands-on Exercise

  1. Environment Setup: Ensure you have bitsandbytes, peft, and accelerate installed.
  2. Task: Load a 7B parameter model using the BitsAndBytesConfig provided above.
  3. Challenge: Compare the VRAM usage (using torch.cuda.memory_allocated()) between loading the model in float16 vs. 4-bit NF4.
  4. Verification: Train for one epoch on a tiny sample dataset and verify that the loss decreases, confirming that the gradients are flowing correctly through the frozen 4-bit weights.

Common Pitfalls

  • Compute Data Type Mismatch: Always set bnb_4bit_compute_dtype to torch.bfloat16 if your hardware supports it (Ampere architecture or newer). Using float32 will lead to significantly higher memory usage and slower training.
  • Targeting Too Many Modules: In standard LoRA, you might target all linear layers. In QLoRA, targeting every linear layer can sometimes lead to instability or excessive memory fragmentation. Start with q_proj and v_proj, then expand only if needed.
  • The "Frozen" Assumption: Remember that QLoRA keeps the base weights frozen. If your model isn't learning, ensure you are not accidentally setting requires_grad=True on the base model parameters.

Recap

QLoRA democratizes fine-tuning by enabling the use of massive models on hardware that was previously limited to small-scale experiments. By leveraging NF4 quantization and double quantization, you reduce the memory footprint of the model weights without sacrificing the ability to adapt the model to new domains.

Up next: We will discuss how to align these fine-tuned models with human preferences using Alignment with RLHF.

Previous lessonParameter-Efficient Fine-Tuning (LoRA)Next lesson Alignment with RLHF
Back to Blog

Similar Posts

AI/MLJune 27, 20263 min read

Project Milestone: Domain-Specific Fine-Tuning for LLMs

Master domain-specific fine-tuning by preparing instruction data, executing QLoRA training, and validating model convergence on your custom project model.

Read more
AI/MLJune 27, 20264 min read

Fine-tuning Methodologies Overview: Strategies for LLM Adaptation

Master fine-tuning methodologies for LLMs. Learn to choose between full fine-tuning and PEFT based on your resource constraints and compute budget.

Part of the course

Advanced AI/ML: Deep Learning, LLMs & Production Systems

advanced · Lesson 17 of 48

  1. 1

    Advanced Weight Initialization Strategies

    4 min
  2. 2

    Normalization Techniques at Scale

    3 min
  3. 3

    High-Dimensional Optimization Landscapes

    4 min
Read more
AI/MLJune 28, 20264 min read

Mixture-of-Experts (MoE) Layers: Scaling Efficiently with Sparsity

Master Mixture-of-Experts (MoE) layers to build scalable, compute-efficient LLMs. Learn to design expert routers, implement sparse layers, and balance load.

Read more
  • 4

    Residual Connections and Gradient Stability

    4 min
  • 5

    Gating Units and Activation Functions

    4 min
  • 6

    Implementing Multi-Head Attention

    4 min
  • 7

    Positional Encoding Architectures

    4 min
  • 8

    Transformer Encoder-Decoder Design

    3 min
  • 9

    Project Milestone: Custom Architecture Setup

    3 min
  • 10

    Tokenization Strategies for LLMs

    3 min
  • 11

    Scaling Laws and Compute Budgets

    4 min
  • 12

    Data Parallelism Strategies

    3 min
  • 13

    Tensor and Pipeline Parallelism

    4 min
  • 14

    Efficient Dataset Loading and Prefetching

    4 min
  • 15

    Fine-tuning Methodologies Overview

    4 min
  • 16

    Parameter-Efficient Fine-Tuning (LoRA)

    4 min
  • 17

    Quantized LoRA (QLoRA)

    4 min
  • 18

    Alignment with RLHF

    4 min
  • 19

    Direct Preference Optimization (DPO)

    4 min
  • 20

    Project Milestone: Domain-Specific Fine-Tuning

    3 min
  • 21

    Vector Databases and Similarity Search

    4 min
  • 22

    Retrieval Strategies for RAG

    3 min
  • 23

    Context Management and Windowing

    4 min
  • 24

    Agentic Tool Use and Function Calling

    4 min
  • 25

    Chain-of-Thought and Multi-Step Reasoning

    4 min
  • 26

    Self-Correction and Iterative Refinement

    4 min
  • 27

    Project Milestone: RAG and Agent Integration

    3 min
  • 28

    Post-Training Quantization (PTQ)

    4 min
  • 29

    Model Pruning Techniques

    4 min
  • 30

    Knowledge Distillation

    4 min
  • 31

    Optimized Inference Runtimes (vLLM)

    4 min
  • 32

    TensorRT-LLM for High-Performance Serving

    3 min
  • 33

    ONNX Runtime for Cross-Platform Inference

    3 min
  • 34

    Project Milestone: Inference Optimization

    3 min
  • 35

    CI/CD for ML (MLOps)

    4 min
  • 36

    Continuous Training (CT) Pipelines

    4 min
  • 37

    Observability and Logging

    4 min
  • 38

    Drift Detection and Data Monitoring

    4 min
  • 39

    LLM-as-a-Judge for Evaluation

    4 min
  • 40

    Scaling Deployments with Kubernetes

    4 min
  • 41

    GPU Resource Allocation and Scheduling

    3 min
  • 42

    Project Milestone: Production Deployment

    3 min
  • 43

    Advanced Activation Checkpointing

    4 min
  • 44

    Mixed Precision Training (FP8/BF16)

    4 min
  • 45

    Distributed Optimizer States

    4 min
  • 46

    Gradient Accumulation and Batch Sizing

    4 min
  • 47

    Multi-Modal Model Architectures

    4 min
  • 48

    Mixture-of-Experts (MoE) Layers

    4 min
  • View full course