File size: 4,523 Bytes
80945c6 |
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 51 52 53 54 55 56 57 58 59 60 61 62 63 64 65 66 67 68 69 70 71 72 73 74 75 76 77 78 79 80 81 82 83 84 85 86 87 88 89 90 91 92 93 94 95 96 97 98 99 100 101 102 103 104 105 106 107 108 109 110 111 112 113 114 115 116 117 118 119 120 121 122 123 124 125 126 127 128 129 130 131 132 133 134 135 136 137 138 139 140 141 142 143 144 145 146 147 148 149 150 151 152 153 154 155 156 157 158 159 160 161 162 163 164 165 166 167 168 169 170 171 172 173 |
---
license: apache-2.0
language:
- af
- am
- ar
- as
- az
- be
- bg
- bn
- br
- bs
- ca
- cs
- cy
- da
- de
- el
- en
- eo
- es
- et
- eu
- fa
- fi
- fr
- fy
- ga
- gd
- gl
- gu
- ha
- he
- hi
- hr
- hu
- hy
- id
- is
- it
- ja
- jv
- ka
- kk
- km
- kn
- ko
- ku
- ky
- la
- lo
- lt
- lv
- mg
- mk
- ml
- mn
- mr
- ms
- my
- ne
- nl
- 'no'
- om
- or
- pa
- pl
- ps
- pt
- ro
- ru
- sa
- sd
- si
- sk
- sl
- so
- sq
- sr
- su
- sv
- sw
- ta
- te
- th
- tl
- tr
- ug
- uk
- ur
- uz
- vi
- xh
- yi
- zh
---
# AffilGood-AffilXLM
For the first two tasks, we fine-tuned two [RoBERTa](https://huggingface.co/docs/transformers/en/model_doc/roberta) and [XLM-RoBERTa](https://huggingface.co/docs/transformers/en/model_doc/xlm-roberta)
models for (predominantly) English and multilingual datasets, respectively. [Gururangan *et al.* (2020)](https://aclanthology.org/2020.acl-main.740.pdf) show that
continuing pre-training language models on task-relevant unlabeled data might contribute to improve the performance of final fine-tuned task-specific
models-in particular, in low-resource situations. Considering the fact that the affiliation strings' *grammar* has its own structure,
which is different from the one that would be expected to be found in free natural language, we explore whether our affiliation span identification and
NER models would benefit from being fine-tuned from models that have been *further pre-trained* on raw affiliation strings for the masked token prediction task.
We adatap models to 10 million random raw affiliation strings from OpenAlex, reporting perplexity on 50k randomly held-out affiliation strings.
In what follows, we refer to our adapted models as AffilRoBERTa (adapted RoBERTa model) and AffilXLM (adapted XLM-RoBERTa).
Specific details of the adaptive pre-training procedure can be found in [Duran-Silva *et al.* (2024)](https://aclanthology.org/2024.sdp-1.13.pdf).
## Evaluation
We report masked language modeling loss as perplexity measure (PPL) on 50k randomly sampled held-out raw affiliation strings.
| **Model** | PPL<sub>base</sub> | PPL<sub>adapt</sub> |
|-----------------|--------------------|----------------------|
| RoBERTa | 1.972 | 1.106 |
| XLM-RoBERTa | 1.997 | 1.101 |
AffilGood-AffilRoBERTa achieves competitive performance to 2 tasks in processing affiliation strings, compared to base models
| Task| RoBERTa | XLM | AffilRoBERTa | **AffilXLM (this model)** |
|-----|------|------|------|----------|
| AffilGood-NER | .910 | .915 | .920 | **.925** |
| AffilGood-SPAN | .929 | .931 | **.938** | .927 |
### Citation
```bibtex
@inproceedings{duran-silva-etal-2024-affilgood,
title = "{A}ffil{G}ood: Building reliable institution name disambiguation tools to improve scientific literature analysis",
author = "Duran-Silva, Nicolau and
Accuosto, Pablo and
Przyby{\l}a, Piotr and
Saggion, Horacio",
editor = "Ghosal, Tirthankar and
Singh, Amanpreet and
Waard, Anita and
Mayr, Philipp and
Naik, Aakanksha and
Weller, Orion and
Lee, Yoonjoo and
Shen, Shannon and
Qin, Yanxia",
booktitle = "Proceedings of the Fourth Workshop on Scholarly Document Processing (SDP 2024)",
month = aug,
year = "2024",
address = "Bangkok, Thailand",
publisher = "Association for Computational Linguistics",
url = "https://aclanthology.org/2024.sdp-1.13",
pages = "135--144",
}
```
### Disclaimer
<details>
<summary>Click to expand</summary>
The model published in this repository is intended for a generalist purpose
and is made available to third parties under a Apache v2.0 License.
Please keep in mind that the model may have bias and/or any other undesirable distortions.
When third parties deploy or provide systems and/or services to other parties using this model
(or a system based on it) or become users of the model itself, they should note that it is under
their responsibility to mitigate the risks arising from its use and, in any event, to comply with
applicable regulations, including regulations regarding the use of Artificial Intelligence.
In no event shall the owners and creators of the model be liable for any results arising from the use made by third parties.
</details> |