What are the differences in Pre-Trained Transformer-base models like BERT, DistilBERT, XLNet, GPT, XLNet, …

9 min readMay 19, 2021

This article is a cheat sheet of well-known Transformer-based models and tries to explain their uniqueness (while they are all based on the same architecture).

The image is asking “What makes each Transformer-base model unique?”

The combination of Transformer architecture and transfer learning is dominating the Natural Language Processing world. There…

What are the differences in Pre-Trained Transformer-base models like BERT, DistilBERT, XLNet, GPT, XLNet, …

Written by Ala Alam Falaki