NewsBERTje

A Domain-Adapted Dutch BERT Model

Model description

NewsBERTje is a domain-adaptated Dutch BERT-based model, aimed towards the processing of texts in the news domain. We set up a domain-specific corpus consisting of 20 million tokens of news articles from online versions of Dutch (and Flemish) newspapers such as NOS, De Morgen, Het Nieuwsblad, Het Laatste Nieuws, De Standaard and Het Belang van Limburg, as well as articles published on the news website of the Flemish public broadcasting agency VRT News. Then, the model was trained with the supplemental corpus for a total of 4 epochs keeping the same hyperparameter configuration that was used in the model's original pre-training (BERTje) .

Performance

The model's performance was benchmarked by evaluating it on a large number of tasks within the Dutch news domain such as news sentiment classification, news event prominence classification, sarcasm and partisanship detection and both coarse -and fine-grained news topic classification. NewsBERTje outperforms both other Dutch BERT models such as BERTje, RobBERTje and RobBERT and a series of generative LLMs (zero-shot settings) on each of these tasks.

Model	Topic Classification (fine)	Topic CLassification (coarse)	Sarcasm Detection	Sentiment Classification	Prominence Classification	Partizanship Detection
BERTje	92.9	82.2	92.9	43.8	74.4	57.8
RobBERT-2023	91.5	80.1	91.6	40.7	69.7	41.7
Llama 3.1-8B-Instruct	81.5	64.7	78.4	33.4	68.7	49.7
GPT 3.5 Turbo	84.6	49.5	51.1	35.8	64.3	51.6
GPT 4o	92.6	64.8	82.6	42.6	70.8	58.7
NewsBERTje	93.8	83.6	93.8	49.2	73.8	61.0

Citation

If you use this model, please cite our work as:

@inproceedings{de2024enhancing,
  title={Enhancing Unrestricted Cross-Document Event Coreference with Graph Reconstruction Networks},
  author={De Langhe, Loic and De Clercq, Orph{\'e}e and Hoste, Veronique},
  booktitle={Proceedings of the 2024 Joint International Conference on Computational Linguistics, Language Resources and Evaluation (LREC-COLING 2024)},
  pages={6122--6133},
  year={2024}
}