How to use [HuggingFace’s] Transformers Pre-Trained tokenizers?
Whether you are using traditional Natural Language Processing (NLP) algorithms or state-of-the-art Deep Neural Networks architectures, tokenization is the first step that enables us to feed textual data to these models. Because It is not possible to feed strings to these models as an input, we need tokenization to form a numerical representation of the data. It means splitting up a document into smaller pieces, which we will call Tokens. There are three levels to tokenizing: Word level, Subword level, and Character level. Without going into further details of each one, a quick representation of how they work can be viewed in the following figure.
The tokenization process’s objective is to build a vocabulary based on the most frequent (which is a predefined number) tokens in the whole dataset. The steps are as follows:
- Split the dataset into tokens
- Count the number of unique tokens that appeared
- Pick the tokens which appeared at least K times
It is essential to save this vocabulary to have a consistent input for our model during both training and inference. (Hence, the pre-trained tokenizers)
The method that we want to focus on is Byte Pair Encoding (BPE) which is a type of subword level tokenization. The…