How to make the vocab.txt file?

#5
by honzabonanza - opened

I would like to add some models we are using to this repo, the only think I am not clear on is where the vocab.txt file comes from?
One of the models I would like to add is: https://huggingface.co/intfloat/multilingual-e5-base/tree/main

never mind, I found the model in the already supported models, it just did not hve the intfloat prefix :)

but still more documentation about the vocab.txt would be helpful

Typesense org

@honzabonanza vocab.txt is vocabulary file for BERT models that is used to tokenize inputs. Every model has different vocabulary file, so the model should have it already in the files section of Huggingface repo.

For e.g. check files of intfloat/e5-small:
https://huggingface.co/intfloat/e5-small/tree/main

However, the model you want to use is not a BERT model, it uses sentencepiece as tokenizer so it has sentenpiece.bpe.model as vocab file. You can see this file in files of the model:
https://huggingface.co/intfloat/multilingual-e5-base/tree/main

For more information on this topic:
https://huggingface.co/docs/transformers/tokenizer_summary

Sign up or log in to comment