microsoft/layoutxlm-base · Add `tokenizer_class` to `config.4.13.0.json`

May 25, 2022

Hi 😀!

I recently noticed that:

from transformers import LayoutXLMProcessor

processor = LayoutXLMProcessor.from_pretrained("microsoft/layoutxlm-base")

was logging the following message

The tokenizer class you load from this checkpoint is not the same type as the class this function is called from. It may result in unexpected tokenization. 
The tokenizer class you load from this checkpoint is 'LayoutLMv2Tokenizer'. 
The class this function is called from is 'LayoutXLMTokenizerFast'.

and

tokenizer = AutoTokenizer.from_pretrained("microsoft/layoutxlm-base")
print(type)

is printing

transformers.models.layoutlmv2.tokenization_layoutlmv2_fast.LayoutLMv2TokenizerFast

I think this is because the tokenizer class is not specified in the configuration file and therefore the default class determined is the one of the model, i.e. LayoutLMv2.

What do you think?

Add `tokenizer_class` to `config.4.13.0.json`43d8d4e8

nielsr changed pull request status to merged May 31, 2022

nielsr

May 31, 2022

Thanks for fixing this!