Maximum token size?
Can someone tell me what the maximum input token size for the instructor model?
I know for ada, I believe it's 8k.
The default maximum length for the INSTRUCTOR model is 512.
from InstructorEmbedding import INSTRUCTOR
model = INSTRUCTOR('hkunlp/instructor-large')
sentence = "3D ActionSLAM: wearable person tracking in multi-floor environments"
instruction = "Represent the Science title:"
embeddings = model.encode([[instruction,sentence]])
print(embeddings)
Does this mean that the maximum length of sentence
must not exceed 512 characters?
If so, should sentence
be cut for every 512 tokens chunk ?
Yes, it is recommended that the maximum length is under 512, and you can split texts into chunks for long documents
512 is "tokens" not "characters," right?
@jwatte Yes!
Language models have a token limit. You should not exceed the token limit.
You can split your text into chunks. It is therefore a good idea to count the number of tokens.
See: LangChain - Split by tokens
Hello!
If I want to create one embedding for a longer document, what is the proposed way to do it?
Would it be to embed multiple chunks of 512 tokens and then take the average of the resulting embedding vectors?
See Evaluating the Ideal Chunk Size for a RAG System using LlamaIndex.
This blog post explains the steps to determine the best fragment size using LlamaIndex's Response Evaluation
module.