model_max_length and max_seq_length
Hi!
First of all great job on this sBERT model!
Secondly, it looks like something weird is going on with the model_max_length
and max_seq_length
attributes when instantiating the model via AutoModel
/ AutoTokenizer
and SentenceTransformer
, respectively.
The sentence-transformers implementation gives a max length of 75:
model_st = SentenceTransformer('NbAiLab/nb-sbert-base')
model_st.max_seq_length
# 75
While loading the tokenizer through HF's AutoTokenizer gives a very different max length:
model = AutoTokenizer.from_pretrained('NbAiLab/nb-sbert-base')
tokenizer.model_max_length
# 1000000000000000019884624838656
The second one is clearly incorrect, but is 75 the correct max sequence length for this model? If I remember correctly, BERT models have a sequence length of 512, or has that changed when finetuning this model?
This also means for sequences longer than 75, both implementations will give different embeddings, which may be worth mentioning.
Hi.
The sequence length of 75 comes from the training script we use.
The other one seems to come from having no max length. The nb-bert-base model has the same value.
The correct one would be 75, but I wouldn't be surprised if you could change max length and input sequences up to 512 with good success.
Thanks for the quick reply and good to know! I'll experiment with input sequences of 512 to see how they compare with the 75-length sequences.