NewsBERTje
A Domain-Adapted Dutch BERT ModelModel description
NewsBERTje is a domain-adaptated Dutch BERT-based model, aimed towards the processing of texts in the news domain. We set up a domain-specific corpus consisting of 20 million tokens of news articles from online versions of Dutch (and Flemish) newspapers such as NOS, De Morgen, Het Nieuwsblad, Het Laatste Nieuws, De Standaard and Het Belang van Limburg, as well as articles published on the news website of the Flemish public broadcasting agency VRT News. Then, the model was trained with the supplemental corpus for a total of 4 epochs keeping the same hyperparameter configuration that was used in the model's original pre-training (BERTje) .
Performance
The model's performance was benchmarked by evaluating it on a large number of tasks within the Dutch news domain such as news sentiment classification, news event prominence classification, sarcasm and partisanship detection and both coarse -and fine-grained news topic classification. NewsBERTje outperforms both other Dutch BERT models such as BERTje, RobBERTje and RobBERT and a series of generative LLMs (zero-shot settings) on each of these tasks.
Model | Topic Classification (fine) | Topic CLassification (coarse) | Sarcasm Detection | Sentiment Classification | Prominence Classification | Partizanship Detection |
---|---|---|---|---|---|---|
BERTje | 92.9 | 82.2 | 92.9 | 43.8 | 74.4 | 57.8 |
RobBERT-2023 | 91.5 | 80.1 | 91.6 | 40.7 | 69.7 | 41.7 |
Llama 3.1-8B-Instruct | 81.5 | 64.7 | 78.4 | 33.4 | 68.7 | 49.7 |
GPT 3.5 Turbo | 84.6 | 49.5 | 51.1 | 35.8 | 64.3 | 51.6 |
GPT 4o | 92.6 | 64.8 | 82.6 | 42.6 | 70.8 | 58.7 |
NewsBERTje | 93.8 | 83.6 | 93.8 | 49.2 | 73.8 | 61.0 |
Citation
If you use this model, please cite our work as:
@inproceedings{de2024enhancing,
title={Enhancing Unrestricted Cross-Document Event Coreference with Graph Reconstruction Networks},
author={De Langhe, Loic and De Clercq, Orph{\'e}e and Hoste, Veronique},
booktitle={Proceedings of the 2024 Joint International Conference on Computational Linguistics, Language Resources and Evaluation (LREC-COLING 2024)},
pages={6122--6133},
year={2024}
}
- Downloads last month
- 2