Set 1024 as default dim, update usage snippets, store prompts in config
Hello @infgrad !
Pull Request overview
- Update default dimension to 1024
- Update usage snippets (separate into 2, reformatting slightly)
- Add prompts to Sentence Transformers config so they're easier to use
Preface
First of all, congratulations on the release! We've already refreshed MTEB and your new models score excellently!
I'd love to learn more about:
- your training approach. If it's a novel new loss, then I might want to add it to Sentence Transformers directly.
- your training data. Is it synthetic data? Or existing datasets? Or something new? A combination of these?
Also, very impressive to get these models (and infgrad/stella_en_1.5B_v5
especially) out so quickly. Alibaba-NLP/gte-Qwen2-1.5B-instruct
is only 2 weeks old!
Details
First of all, from my open source work I recognize that the large majority of users will always stick with the default options for some software, including models. As a result, your choice of default output dimensionality has a big impact on what dimensionality people will use. I think that 8192 is simply too large for efficiently computing any downstream tasks. You mention yourself that 1024 only loses 0.001 compared to 8192, so I think you should use that as the default. By the way, I quite like that advanced users can adapt this model to their likings!
As for the consequences that it has for MTEB: I'm open to providing 2 or 3 MTEB scores per model for different output dimensionalities (e.g. 256, 1024, 8192). We've done this for OpenAI's text-embedding-3-large
as well as nomic-embed-text-v1.5
before.
I've also simplified the usage snippets so they're more "copy-pasteable" than before. The Sentence Transformers snippet should now work out of the box, and the Transformers one works after cloning the model and updating the model_dir. I think this should also help with adoption, just like the better default dim. Another change is that for Sentence Transformers I recommend users to update the "path"
in modules.json
from e.g. 2_Dense_1024
to e.g. 2_Dense_256
. Then people don't have to move any files around.
Lastly, I've added the s2p and s2s prompts to config_sentence_transformers.json
. Big thanks for proposing 2 concrete prompts for people to use, I think that'll be very helpful. Sentence Transformers encode
method accepts a prompt
option for providing a string prompt, or a prompt_name
option for providing a name for a prompt stored in config_sentence_transformers.json
. I've used the latter in the usage snippet.
Please let me know if you'd like me to change anything in this PR or if you only want to merge part of it. If you like these changes, then I can also implement them for https://huggingface.co/infgrad/stella_en_1.5B_v5.
- Tom Aarsen