Update tokenizer_config.json
#6
by
Xenova
HF staff
- opened
Without setting model_max_length=512
, it breaks with HF's pipeline function when using >512 tokens:
from transformers import pipeline
pipe = pipeline('feature-extraction', 'thenlper/gte-small')
input = "# 2024 Summer Olympics\n\n## The Games\\[edit\\]\n\n### Sports\\[edit\\]\n\nle\"><span>Image</span></span> Basketball<ul><li>Basketball <small>(2)</small></li><li>3×3 basketball <small>(2)</small></li></ul></li><li><span typeof=\"mw:File\"><span>Image</span></span> Boxing <small>(13)</small></li><li><span typeof=\"mw:File\"><span>Image</span></span> Breaking <small>(2)</small></li></ul></td><td><ul><li><span typeof=\"mw:File\"><span>Image</span></span> Canoeing<ul><li>Slalom <small>(6)</small></li><li>Sprint <small>(10)</small></li></ul></li><li><span typeof=\"mw:File\"><span>Image</span></span> Cycling<ul><li>BMX freestyle <small>(2)</small></li><li>BMX racing <small>(2)</small></li><li>Mountain biking <small>(2)</small></li><li>Road <small>(4)</small></li><li>Track <small>(12)</small></li></ul></li><li><span typeof=\"mw:File\"><span>Image</span></span> Equestrian<ul><li>Dressage <small>(2)</small></li><li>Eventing <small>(2)</small></li><li>Jumping <small>(2)</small></li></ul></li><li><span typeof=\"mw:File\"><span>Image</span><a"
output = pipe(input)
outputs
236 if self.position_embedding_type == "absolute":
237 position_embeddings = self.position_embeddings(position_ids)
--> 238 embeddings += position_embeddings
239 embeddings = self.LayerNorm(embeddings)
240 embeddings = self.dropout(embeddings)
RuntimeError: The size of tensor a (513) must match the size of tensor b (512) at non-singleton dimension 1
See https://github.com/xenova/transformers.js/issues/355 for more information. This modification was also made to the Transformers.js-compatible version of the model: https://huggingface.co/Xenova/gte-small/commit/7ca943b8ff118ce9eb87aa3a5669f26e3d633fd7