SIRIS-Lab
/

affilgood-affilxlm

Safetensors

xlm-roberta

Model card Files Files and versions Community

nicolauduran45 commited on 1 day ago

Commit

80945c6

•

1 Parent(s): 9bb9f48

Update README.md

Browse files

Files changed (1) hide show

README.md +173 -3

README.md CHANGED Viewed

@@ -1,3 +1,173 @@
----
-license: apache-2.0
----

+---
+license: apache-2.0
+language:
+  - af
+  - am
+  - ar
+  - as
+  - az
+  - be
+  - bg
+  - bn
+  - br
+  - bs
+  - ca
+  - cs
+  - cy
+  - da
+  - de
+  - el
+  - en
+  - eo
+  - es
+  - et
+  - eu
+  - fa
+  - fi
+  - fr
+  - fy
+  - ga
+  - gd
+  - gl
+  - gu
+  - ha
+  - he
+  - hi
+  - hr
+  - hu
+  - hy
+  - id
+  - is
+  - it
+  - ja
+  - jv
+  - ka
+  - kk
+  - km
+  - kn
+  - ko
+  - ku
+  - ky
+  - la
+  - lo
+  - lt
+  - lv
+  - mg
+  - mk
+  - ml
+  - mn
+  - mr
+  - ms
+  - my
+  - ne
+  - nl
+  - 'no'
+  - om
+  - or
+  - pa
+  - pl
+  - ps
+  - pt
+  - ro
+  - ru
+  - sa
+  - sd
+  - si
+  - sk
+  - sl
+  - so
+  - sq
+  - sr
+  - su
+  - sv
+  - sw
+  - ta
+  - te
+  - th
+  - tl
+  - tr
+  - ug
+  - uk
+  - ur
+  - uz
+  - vi
+  - xh
+  - yi
+  - zh
+---
+# AffilGood-AffilXLM
+For the first two tasks, we fine-tuned two [RoBERTa](https://huggingface.co/docs/transformers/en/model_doc/roberta) and [XLM-RoBERTa](https://huggingface.co/docs/transformers/en/model_doc/xlm-roberta)
+models for (predominantly) English and multilingual datasets, respectively. [Gururangan *et al.* (2020)](https://aclanthology.org/2020.acl-main.740.pdf) show that
+continuing pre-training language models on task-relevant unlabeled data might contribute to improve the performance of final fine-tuned task-specific
+models-in particular, in low-resource situations. Considering the fact that the affiliation strings' *grammar* has its own structure,
+which is different from the one that would be expected to be found in free natural language, we explore whether our affiliation span identification and
+NER models would benefit from being fine-tuned from models that have been *further pre-trained* on raw affiliation strings for the masked token prediction task.
+We adatap models to 10 million random raw affiliation strings from OpenAlex, reporting perplexity on 50k randomly held-out affiliation strings.
+In what follows, we refer to our adapted models as AffilRoBERTa (adapted RoBERTa model) and AffilXLM (adapted XLM-RoBERTa).
+Specific details of the adaptive pre-training procedure can be found in [Duran-Silva *et al.* (2024)](https://aclanthology.org/2024.sdp-1.13.pdf).
+## Evaluation
+We report masked language modeling loss as perplexity measure (PPL) on 50k randomly sampled held-out raw affiliation strings.
+| **Model**       | PPL<sub>base</sub> | PPL<sub>adapt</sub> |
+|-----------------|--------------------|----------------------|
+| RoBERTa         | 1.972             | 1.106               |
+| XLM-RoBERTa     | 1.997             | 1.101               |
+AffilGood-AffilRoBERTa achieves competitive performance to 2 tasks in processing affiliation strings, compared to base models
+| Task| RoBERTa | XLM | AffilRoBERTa | **AffilXLM (this model)** |
+|-----|------|------|------|----------|
+| AffilGood-NER | .910 | .915 | .920 | **.925** |
+| AffilGood-SPAN | .929 | .931 | **.938** | .927 |
+### Citation
+```bibtex
+@inproceedings{duran-silva-etal-2024-affilgood,
+    title = "{A}ffil{G}ood: Building reliable institution name disambiguation tools to improve scientific literature analysis",
+    author = "Duran-Silva, Nicolau  and
+      Accuosto, Pablo  and
+      Przyby{\l}a, Piotr  and
+      Saggion, Horacio",
+    editor = "Ghosal, Tirthankar  and
+      Singh, Amanpreet  and
+      Waard, Anita  and
+      Mayr, Philipp  and
+      Naik, Aakanksha  and
+      Weller, Orion  and
+      Lee, Yoonjoo  and
+      Shen, Shannon  and
+      Qin, Yanxia",
+    booktitle = "Proceedings of the Fourth Workshop on Scholarly Document Processing (SDP 2024)",
+    month = aug,
+    year = "2024",
+    address = "Bangkok, Thailand",
+    publisher = "Association for Computational Linguistics",
+    url = "https://aclanthology.org/2024.sdp-1.13",
+    pages = "135--144",
+}
+```
+### Disclaimer
+<details>
+<summary>Click to expand</summary>
+The model published in this repository is intended for a generalist purpose
+and is made available to third parties under a Apache v2.0 License.
+Please keep in mind that the model may have bias and/or any other undesirable distortions.
+When third parties deploy or provide systems and/or services to other parties using this model
+(or a system based on it) or become users of the model itself, they should note that it is under
+their responsibility to mitigate the risks arising from its use and, in any event, to comply with
+applicable regulations, including regulations regarding the use of Artificial Intelligence.
+In no event shall the owners and creators of the model be liable for any results arising from the use made by third parties.
+</details>