Update/Fix incorrect model_max_length to 1024 tokens
Currently, the field model_max_length
is set to be 1000000000000000019884624838656
tokens which is incorrect. This leads to this model when being used in a pipeline either cannot enable automatic truncating when the length gets exceeded which get an error thrown like RuntimeError: The expanded size of the tensor (<SOME NUMBER LARGER THAN 1024>) must match the existing size (1024) at non-singleton dimension 1. Target sizes: [1, <SOME NUMBER LARGER THAN 1024>]. Tensor sizes: [1, 1024]
, or it cannot use the stride
option which also relies on a correct model_max_length
being provided.
Description of stride option in a token classification pipeline:If stride is provided, the pipeline is applied on all the text. The text is split into chunks of size model_max_length. Works only with fast tokenizers and aggregation_strategy different from NONE. The value of this argument defines the number of overlapping tokens between chunks. In other words, the model will shift forward by tokenizer.model_max_length - stride tokens each step.