How to use [HuggingFace’s] Transformers Pre-Trained tokenizers?

Ala Falaki, PhD
5 min readMar 17, 2021

Whether you are using traditional Natural Language Processing (NLP) algorithms or state-of-the-art Deep Neural Networks architectures, tokenization is the first step that enables us to feed textual data to these models. Because It is not possible to feed strings to these models as an input, we need tokenization to form a numerical representation of the data. It means splitting up a document into smaller pieces, which we will call Tokens. There are three levels to tokenizing: Word level, Subword level, and Character level. Without going into further details of each one, a quick representation of how they work can be viewed in the following figure.

Basically, the character level will break-down each character in a sequence and look at each character as a token. The word level method will see each word as a token, but the sub-word level is more advance and has the ability to break-down complex words like “tokenizing” to a combination of “token” + “izing”
Figure 1. A simple figure to show each method’s output with respect to a sample text which is “This is Tokenizing”.

The tokenization process’s objective is to build a vocabulary based on the most frequent (which is a predefined number) tokens in the whole dataset. The steps are as follows:

  1. Split the dataset into tokens
  2. Count the number of unique tokens that appeared
  3. Pick the tokens which appeared at least K times

It is essential to save this vocabulary to have a consistent input for our model during both training and inference. (Hence, the pre-trained tokenizers)

The method that we want to focus on is Byte Pair Encoding (BPE) which is a type of subword level tokenization. The…

--

--

Ala Falaki, PhD

Technical Editor @ Towards AI - Write about NLP here. Let's talk on Twitter! https://nlpiation.github.io/