1 | initial version |
To preserve the DistilBertTokenizer tokenizer, you can save it as a file using the PyTorch save_pretrained()
method. Here is an example:
from transformers import DistilBertTokenizer
tokenizer = DistilBertTokenizer.from_pretrained('distilbert-base-uncased')
train_texts = [...] # list of training texts
# Tokenize the training texts
train_encodings = tokenizer(train_texts, truncation=True, padding=True)
# Save the tokenizer
tokenizer.save_pretrained('/path/to/tokenizer')
The save_pretrained()
method saves the tokenizer configuration file and the vocabulary file to the specified directory. You can then load the tokenizer later using the from_pretrained()
method:
from transformers import DistilBertTokenizer
tokenizer = DistilBertTokenizer.from_pretrained('/path/to/tokenizer')
This will load the tokenizer configuration and vocabulary from the directory, allowing you to tokenize new texts using the same vocabulary and settings as before.