SIRIS-Lab
/

affilgood-affilroberta

+---
+license: apache-2.0
+language:
+- en
+---
+# AffilGood-AffilRoBERTa
+For the first two tasks, we fine-tuned two [RoBERTa](https://huggingface.co/docs/transformers/en/model_doc/roberta) and [XLM-RoBERTa](https://huggingface.co/docs/transformers/en/model_doc/xlm-roberta)
+models for (predominantly) English and multilingual datasets, respectively. [Gururangan *et al.* (2020)](https://aclanthology.org/2020.acl-main.740.pdf) show that
+continuing pre-training language models on task-relevant unlabeled data might contribute to improve the performance of final fine-tuned task-specific
+models-in particular, in low-resource situations. Considering the fact that the affiliation strings' *grammar* has its own structure,
+which is different from the one that would be expected to be found in free natural language, we explore whether our affiliation span identification and
+NER models would benefit from being fine-tuned from models that have been *further pre-trained* on raw affiliation strings for the masked token prediction task.
+We adatap RoBERTa-ase to 10 million random raw affiliation strings from OpenAlex, reporting perplexity on 50k randomly held-out affiliation strings.
+In what follows, we refer to our adapted models as AffilRoBERTa (adapted RoBERTa model) and AffilXLM (adapted XLM-RoBERTa).
+Specific details of the adaptive pre-training procedure can be found in [Duran-Silva *et al.* (2024)](https://aclanthology.org/2024.sdp-1.13.pdf).
+## Evaluation
+We report masked language modeling loss as perplexity measure (PPL) on 50k randomly sampled held-out raw affiliation strings.
+| **Model**       | PPL<sub>base</sub> | PPL<sub>adapt</sub> |
+|-----------------|--------------------|----------------------|
+| RoBERTa         | 1.972             | 1.106               |
+| XLM-RoBERTa     | 1.997             | 1.101               |
+AffilGood-AffilRoBERTa achieves competitive performance to 2 tasks in processing affiliation strings, compared to base models
+| Task| RoBERTa | XLM | **AffilRoBERTa (this model)** | AffilXLM |
+|-----|------|------|------|----------|
+| AffilGood-NER | .910 | .915 | .920 | **.925** |
+| AffilGood-SPAN | .929 | .931 | **.938** | .927 |
+### Citation
+```bibtex
+@inproceedings{duran-silva-etal-2024-affilgood,
+    title = "{A}ffil{G}ood: Building reliable institution name disambiguation tools to improve scientific literature analysis",
+    author = "Duran-Silva, Nicolau  and
+      Accuosto, Pablo  and
+      Przyby{\l}a, Piotr  and
+      Saggion, Horacio",
+    editor = "Ghosal, Tirthankar  and
+      Singh, Amanpreet  and
+      Waard, Anita  and
+      Mayr, Philipp  and
+      Naik, Aakanksha  and
+      Weller, Orion  and
+      Lee, Yoonjoo  and
+      Shen, Shannon  and
+      Qin, Yanxia",
+    booktitle = "Proceedings of the Fourth Workshop on Scholarly Document Processing (SDP 2024)",
+    month = aug,
+    year = "2024",
+    address = "Bangkok, Thailand",
+    publisher = "Association for Computational Linguistics",
+    url = "https://aclanthology.org/2024.sdp-1.13",
+    pages = "135--144",
+}
+```
+### Disclaimer
+<details>
+<summary>Click to expand</summary>
+The model published in this repository is intended for a generalist purpose
+and is made available to third parties under a Apache v2.0 License.
+Please keep in mind that the model may have bias and/or any other undesirable distortions.
+When third parties deploy or provide systems and/or services to other parties using this model
+(or a system based on it) or become users of the model itself, they should note that it is under
+their responsibility to mitigate the risks arising from its use and, in any event, to comply with
+applicable regulations, including regulations regarding the use of Artificial Intelligence.
+In no event shall the owners and creators of the model be liable for any results arising from the use made by third parties.
+</details>