Disease mention recognizer for Spanish Social Media texts 🦠💬

This resource derives from the participation of the SINAI team in Mining Social Media Content for Disease Mention (SocialDisNER) shared task. This task focused on the recognition of disease mentions in tweets written in Spanish with the aim of using Twitter as a proxy to better understand societal perception of disease. This task brought the community effort to developing named entity recognition (NER) approaches to detect all kinds of disease mentions in social media text.

Our approach is based on a model pre-trained on general-domain text. In order to leverage large scale additional Silver Standard data with automatically generated labels provided by task’s organisers we designed a two-stage fine-tuning framework.

Results

The model contained in this repository constitutes the fundament of the NER system presented by the SINAI team on SocialDisNER. Enhanced with data pysentimiento pre-processing and rule-based submission post-processing, it obtained encouraging results during the official evaluation, which are summarised in the table below.

Precision	Recall	F1-score
0.756	0. 795	0.770

System description paper and citation

The system description paper was be published at Social Media Mining for Health Application (#SMM4H) held on COLING22 in October 2022.

@inproceedings{chizhikova-etal-2022-sinai,
    title = "{SINAI}@{SMM}4{H}{'}22: Transformers for biomedical social media text mining in {S}panish",
    author = "Chizhikova, Mariia  and
      L{\'o}pez-{\'U}beda, Pilar  and
      D{\'\i}az-Galiano, Manuel C.  and
      Ure{\~n}a-L{\'o}pez, L. Alfonso  and
      Mart{\'\i}n-Valdivia, M. Teresa",
    booktitle = "Proceedings of The Seventh Workshop on Social Media Mining for Health Applications, Workshop {\&} Shared Task",
    month = oct,
    year = "2022",
    address = "Gyeongju, Republic of Korea",
    publisher = "Association for Computational Linguistics",
    url = "https://aclanthology.org/2022.smm4h-1.8",
    pages = "27--30",
    abstract = "This paper covers participation of the SINAI team in Tasks 5 and 10 of the Social Media Mining for Health ({\#}SSM4H) workshop at COLING-2022. These tasks focus on leveraging Twitter posts written in Spanish for healthcare research. The objective of Task 5 was to classify tweets reporting COVID-19 symptoms, while Task 10 required identifying disease mentions in Twitter posts. The presented systems explore large RoBERTa language models pre-trained on Twitter data in the case of tweet classification task and general-domain data for the disease recognition task. We also present a text pre-processing methodology implemented in both systems and describe an initial weakly-supervised fine-tuning phase alongside with a submission post-processing procedure designed for Task 10. The systems obtained 0.84 F1-score on the Task 5 and 0.77 F1-score on Task 10.",
}