Text context length?
What's the text context length for jina-clip-v1
?
Could be 8192? That's what the tokenizer config says anyways.
https://huggingface.co/jinaai/jina-clip-v1/blob/1bae0621529ced998c73bca234a8cb9da997f33c/tokenizer_config.json#L49
Clip patch32 is 77.
Curious to know where this stands.
Ah scratch that, the paper states - https://arxiv.org/pdf/2405.20204
For stage 2, Ctext pairs
is used again. However, text values are truncated to 512 tokens in this case, and as a result a smaller batch size of 8,192 is used.
So looks like it's 512, if I'm reading that right?
I had the same question.
The largest size that is well aligned with images per training seems to be 512 instead. However, this might generalize further, for example if the third and last stage of finetuning allows for longer text-only sequences (this unfortunately isn't mentioned in the paper). It might also weakly generalize just because the initial BERT model supported longer input texts (8192 it seems, per the config file), but this would have to be tested.
I would love to get some clarity on that. Any thoughts, @gmastrapas or @bwang0911 ?
hi all, our backbone model JinaBERT support very long sequence (we say up to 8192, but should be unlimited).
we contrastively train the model with a seq length of 512 on embedding tasks, but this does not mean that the model can only handle 512, it should be able to handle much longer sequence, same as jina-embeddings-v2.
However, our experience tell us the best sequence length to get sentence embeddings is around ~512-1000. My suggestion is keep the document below 1000 tokens, but it will definitely work beyond much longer than 1000.