Safetensors
xlm-roberta
nicolauduran45 commited on
Commit
80945c6
1 Parent(s): 9bb9f48

Update README.md

Browse files
Files changed (1) hide show
  1. README.md +173 -3
README.md CHANGED
@@ -1,3 +1,173 @@
1
- ---
2
- license: apache-2.0
3
- ---
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1
+ ---
2
+ license: apache-2.0
3
+ language:
4
+ - af
5
+ - am
6
+ - ar
7
+ - as
8
+ - az
9
+ - be
10
+ - bg
11
+ - bn
12
+ - br
13
+ - bs
14
+ - ca
15
+ - cs
16
+ - cy
17
+ - da
18
+ - de
19
+ - el
20
+ - en
21
+ - eo
22
+ - es
23
+ - et
24
+ - eu
25
+ - fa
26
+ - fi
27
+ - fr
28
+ - fy
29
+ - ga
30
+ - gd
31
+ - gl
32
+ - gu
33
+ - ha
34
+ - he
35
+ - hi
36
+ - hr
37
+ - hu
38
+ - hy
39
+ - id
40
+ - is
41
+ - it
42
+ - ja
43
+ - jv
44
+ - ka
45
+ - kk
46
+ - km
47
+ - kn
48
+ - ko
49
+ - ku
50
+ - ky
51
+ - la
52
+ - lo
53
+ - lt
54
+ - lv
55
+ - mg
56
+ - mk
57
+ - ml
58
+ - mn
59
+ - mr
60
+ - ms
61
+ - my
62
+ - ne
63
+ - nl
64
+ - 'no'
65
+ - om
66
+ - or
67
+ - pa
68
+ - pl
69
+ - ps
70
+ - pt
71
+ - ro
72
+ - ru
73
+ - sa
74
+ - sd
75
+ - si
76
+ - sk
77
+ - sl
78
+ - so
79
+ - sq
80
+ - sr
81
+ - su
82
+ - sv
83
+ - sw
84
+ - ta
85
+ - te
86
+ - th
87
+ - tl
88
+ - tr
89
+ - ug
90
+ - uk
91
+ - ur
92
+ - uz
93
+ - vi
94
+ - xh
95
+ - yi
96
+ - zh
97
+ ---
98
+
99
+ # AffilGood-AffilXLM
100
+
101
+ For the first two tasks, we fine-tuned two [RoBERTa](https://huggingface.co/docs/transformers/en/model_doc/roberta) and [XLM-RoBERTa](https://huggingface.co/docs/transformers/en/model_doc/xlm-roberta)
102
+ models for (predominantly) English and multilingual datasets, respectively. [Gururangan *et al.* (2020)](https://aclanthology.org/2020.acl-main.740.pdf) show that
103
+ continuing pre-training language models on task-relevant unlabeled data might contribute to improve the performance of final fine-tuned task-specific
104
+ models-in particular, in low-resource situations. Considering the fact that the affiliation strings' *grammar* has its own structure,
105
+ which is different from the one that would be expected to be found in free natural language, we explore whether our affiliation span identification and
106
+ NER models would benefit from being fine-tuned from models that have been *further pre-trained* on raw affiliation strings for the masked token prediction task.
107
+
108
+ We adatap models to 10 million random raw affiliation strings from OpenAlex, reporting perplexity on 50k randomly held-out affiliation strings.
109
+ In what follows, we refer to our adapted models as AffilRoBERTa (adapted RoBERTa model) and AffilXLM (adapted XLM-RoBERTa).
110
+
111
+ Specific details of the adaptive pre-training procedure can be found in [Duran-Silva *et al.* (2024)](https://aclanthology.org/2024.sdp-1.13.pdf).
112
+
113
+ ## Evaluation
114
+
115
+ We report masked language modeling loss as perplexity measure (PPL) on 50k randomly sampled held-out raw affiliation strings.
116
+
117
+ | **Model** | PPL<sub>base</sub> | PPL<sub>adapt</sub> |
118
+ |-----------------|--------------------|----------------------|
119
+ | RoBERTa | 1.972 | 1.106 |
120
+ | XLM-RoBERTa | 1.997 | 1.101 |
121
+
122
+ AffilGood-AffilRoBERTa achieves competitive performance to 2 tasks in processing affiliation strings, compared to base models
123
+
124
+ | Task| RoBERTa | XLM | AffilRoBERTa | **AffilXLM (this model)** |
125
+ |-----|------|------|------|----------|
126
+ | AffilGood-NER | .910 | .915 | .920 | **.925** |
127
+ | AffilGood-SPAN | .929 | .931 | **.938** | .927 |
128
+
129
+
130
+ ### Citation
131
+
132
+ ```bibtex
133
+ @inproceedings{duran-silva-etal-2024-affilgood,
134
+ title = "{A}ffil{G}ood: Building reliable institution name disambiguation tools to improve scientific literature analysis",
135
+ author = "Duran-Silva, Nicolau and
136
+ Accuosto, Pablo and
137
+ Przyby{\l}a, Piotr and
138
+ Saggion, Horacio",
139
+ editor = "Ghosal, Tirthankar and
140
+ Singh, Amanpreet and
141
+ Waard, Anita and
142
+ Mayr, Philipp and
143
+ Naik, Aakanksha and
144
+ Weller, Orion and
145
+ Lee, Yoonjoo and
146
+ Shen, Shannon and
147
+ Qin, Yanxia",
148
+ booktitle = "Proceedings of the Fourth Workshop on Scholarly Document Processing (SDP 2024)",
149
+ month = aug,
150
+ year = "2024",
151
+ address = "Bangkok, Thailand",
152
+ publisher = "Association for Computational Linguistics",
153
+ url = "https://aclanthology.org/2024.sdp-1.13",
154
+ pages = "135--144",
155
+ }
156
+ ```
157
+
158
+ ### Disclaimer
159
+
160
+ <details>
161
+ <summary>Click to expand</summary>
162
+
163
+ The model published in this repository is intended for a generalist purpose
164
+ and is made available to third parties under a Apache v2.0 License.
165
+
166
+ Please keep in mind that the model may have bias and/or any other undesirable distortions.
167
+ When third parties deploy or provide systems and/or services to other parties using this model
168
+ (or a system based on it) or become users of the model itself, they should note that it is under
169
+ their responsibility to mitigate the risks arising from its use and, in any event, to comply with
170
+ applicable regulations, including regulations regarding the use of Artificial Intelligence.
171
+
172
+ In no event shall the owners and creators of the model be liable for any results arising from the use made by third parties.
173
+ </details>