Text Classification of documents with high variance in lengths
#1
by
hanshupe
- opened
I want to categorize text documents, which have a length between 25 and 3000 words. Language models like BERT only support 512 tokens. It looks like sometimes all 3000 words are needed semantically.
I wonder if a use a model like Longformer, if there is any problem if there is such a high variance in document lengths? Would it be better to train 2 separate classifiers for different lengths?