What are the differences in Pre-Trained Transformer-base models like BERT, DistilBERT, XLNet, GPT, XLNet, …
9 min readMay 19, 2021
This article is a cheat sheet of well-known Transformer-based models and tries to explain their uniqueness (while they are all based on the same architecture).
The combination of Transformer architecture and transfer learning is dominating the Natural Language Processing world. There…