Q-GaLore: Pre-Train 7B Parameter LLMs from Scratch on a 16 GB GPU
Start now, get your model in 50 years!
Pre-training large language models (LLMs) from scratch is extremely expensive. It requires professional GPUs with huge amounts of memory for many weeks. To make it possible for consumer hardware, we must first reduce memory consumption.
A few months ago, I presented GaLore, a method projecting the gradients into low-rank subspaces to minimize the memory footprint of the model’s optimization. With GaLore, full fine-tuning and pre-training from scratch of 7B parameter LLMs are possible with a 32 GB GPU (24 GB GPU with layerwise updates).
However, with GaLore the model itself still consumes a significant amount of memory. An 8B model such as Llama 3 consumes more than 16 GB of GPU RAM. If we could quantize it during pre-training, it would significantly reduce memory consumption and unlock pre-training for 8B models on a 16 GB GPU.
This is what Q-GaLore does. This method manages to apply GaLore on quantized LLMs for pre-training from scratch. It also reduces the memory consumption of fine-tuning by up to 50% compared to LoRA and consistently outperforms QLoRA with bitsandbytes.
In this article, we will review Q-GaLore. We will see how it works and how it significantly reduces memory consumption. I explain all the hyperparameters/arguments of the method, one by one. While reducing memory consumption is important, the introduction of quantization/dequantization operations might significantly slow down training. We will check how long it would take, using this approach, to train 130M, 3B, and 7B parameter LLMs (Llama architecture) on consumer GPUs (RTX 4090 24 GB, RTX 3090 24 GB and 4080 Super 16 GB; provided by RunPod (referral link)).
This pre-training code and logs for 130M, 3B, and 7B parameter LLMs on 16 GB and 24 GB GPUs are provided in this notebook: