Text Classification of documents with high variance in lengths

by hanshupe - opened Jun 11, 2022

Jun 11, 2022

I want to categorize text documents, which have a length between 25 and 3000 words. Language models like BERT only support 512 tokens. It looks like sometimes all 3000 words are needed semantically.

I wonder if a use a model like Longformer, if there is any problem if there is such a high variance in document lengths? Would it be better to train 2 separate classifiers for different lengths?

Upload images, audio, and videos by dragging in the text input, pasting, or clicking here.

Tap or paste here to upload images

· Sign up or log in to comment