1000x Smaller GPT-3/2? LoRA: Low-Rank Adaptation of Large Language Models

Ala Falaki, PhD
4 min readJun 19, 2021

Is it possible to use large models such as GPT-3 (175B parameters) for downstream tasks with training only 37M parameters and outperform the fine-tuned model?

Figure 1. The overall structure of LoRA. The Pre-trained Weights are frozen. A, and B will be trained on downstream tasks. [1]

Everyone knows there are lots of problems with the direction Deep Neural Networks models are going. Yes! It feels like they are getting larger every minute. While it is beneficial to have these pre-trained models to chose from, it is getting really hard to find enough resources to fine-tune them for any downstream task.

But, What if I tell you there is no need to fine-tune the model anymore?

I will try to explain the basic ideas of the LoRA [1] paper without going into detail. The whole concept is formed based on the Aghajanyan et al. [2] work which showed that the language models can still perform well if we convert the parameters to a lower dimension. Also, as you might know, the self-attention mechanism in the Transformers is the most significant contributor to the overall number of parameters. So, the main focus of the LoRA paper is the model’s self-attention query/value weights. Even though the method could be applied to all the dense layers.

As shown in Figure 1, considering a model with input x (size d), the blue square on the left represents the original self-attention vector with…

--

--