Max tokens
Thanks for sharing this model with the community.
What's the max number of tokens that can be embedded with this? I noticed that it logs "max_seq_length 512" every time the model is loaded. Is that 512 characters?
Thanks a lot for your interests in our INSTRUCTOR model!
Your understanding is correct! By default, the maximum sequence length is 512. For changing the maximum sequence length, you may refer to https://github.com/HKUNLP/instructor-embedding/issues/12.
Hope this helps! Feel free to add any further questions or comments!
Thanks for the link. That helped answer a number of questions I had.
What's the tokenizer I should use if I were to chunk a long text before generating embeddings? I skimmed through the code, and found references to AutoTransformer
and T5. So will something like the following work?
TOKENIZER = T5Tokenizer.from_pretrained('t5-large', model_max_length=512)
SPLITTER = RecursiveCharacterTextSplitter.from_huggingface_tokenizer(TOKENIZER, chunk_size=512, chunk_overlap=0)
Hi, Thanks a lot for your comments!
The recommended tokenizer for calculating the sequence length would be the INSTRUCTOR tokenizer. For example:
from transformers import AutoTokenizer
tokenizer = AutoTokenizer.from_pretrained('hkunlp/instructor-large') # initialize the INSTRUCTOR tokenizer
text = "Hello, world!"
text_length = len(tokenizer(text))
print(text_length)
Hope this helps! Feel free to add any further questions or comments!
Very glad I found this thread. Is there any way to easily turn on a truncation warning? I have text that I'm chunking but can have large variations in token length.
Hi, Thanks a lot for your comments!
The recommended tokenizer for calculating the sequence length would be the INSTRUCTOR tokenizer. For example:
from transformers import AutoTokenizer tokenizer = AutoTokenizer.from_pretrained('hkunlp/instructor-large') # initialize the INSTRUCTOR tokenizer text = "Hello, world!" text_length = len(tokenizer(text)) print(text_length)
Hope this helps! Feel free to add any further questions or comments!
Small fix:
text_length = len(tokenizer(text)['input_ids'])