thenlper/gte-small · Update tokenizer

Without setting model_max_length=512, it breaks with HF's pipeline function when using >512 tokens:

from transformers import pipeline
pipe = pipeline('feature-extraction', 'thenlper/gte-small')
input = "# 2024 Summer Olympics\n\n## The Games\\[edit\\]\n\n### Sports\\[edit\\]\n\nle\"><span>Image</span></span> Basketball<ul><li>Basketball <small>(2)</small></li><li>3×3 basketball <small>(2)</small></li></ul></li><li><span typeof=\"mw:File\"><span>Image</span></span> Boxing <small>(13)</small></li><li><span typeof=\"mw:File\"><span>Image</span></span> Breaking <small>(2)</small></li></ul></td><td><ul><li><span typeof=\"mw:File\"><span>Image</span></span> Canoeing<ul><li>Slalom <small>(6)</small></li><li>Sprint <small>(10)</small></li></ul></li><li><span typeof=\"mw:File\"><span>Image</span></span> Cycling<ul><li>BMX freestyle <small>(2)</small></li><li>BMX racing <small>(2)</small></li><li>Mountain biking <small>(2)</small></li><li>Road <small>(4)</small></li><li>Track <small>(12)</small></li></ul></li><li><span typeof=\"mw:File\"><span>Image</span></span> Equestrian<ul><li>Dressage <small>(2)</small></li><li>Eventing <small>(2)</small></li><li>Jumping <small>(2)</small></li></ul></li><li><span typeof=\"mw:File\"><span>Image</span><a"
output = pipe(input)

outputs

    236         if self.position_embedding_type == "absolute":
    237             position_embeddings = self.position_embeddings(position_ids)
--> 238             embeddings += position_embeddings
    239         embeddings = self.LayerNorm(embeddings)
    240         embeddings = self.dropout(embeddings)

RuntimeError: The size of tensor a (513) must match the size of tensor b (512) at non-singleton dimension 1

See https://github.com/xenova/transformers.js/issues/355 for more information. This modification was also made to the Transformers.js-compatible version of the model: https://huggingface.co/Xenova/gte-small/commit/7ca943b8ff118ce9eb87aa3a5669f26e3d633fd7

thenlper
/

gte-small

Update tokenizer_config.json