add model card
Browse files
README.md
ADDED
@@ -0,0 +1,229 @@
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
1 |
+
---
|
2 |
+
|
3 |
+
language: bn
|
4 |
+
|
5 |
+
tags:
|
6 |
+
|
7 |
+
- collaborative
|
8 |
+
|
9 |
+
- bengali
|
10 |
+
|
11 |
+
- albert
|
12 |
+
|
13 |
+
- bangla
|
14 |
+
|
15 |
+
license: apache-2.0
|
16 |
+
|
17 |
+
datasets:
|
18 |
+
|
19 |
+
- Wikipedia
|
20 |
+
|
21 |
+
- Oscar
|
22 |
+
|
23 |
+
widget:
|
24 |
+
|
25 |
+
- text: "ধন্যবাদ। আপনার সাথে কথা [MASK] ভালো লাগলো"
|
26 |
+
|
27 |
+
---
|
28 |
+
|
29 |
+
<!-- TODO: change widget text -->
|
30 |
+
|
31 |
+
# sahajBERT
|
32 |
+
|
33 |
+
Collaboratively pre-trained model on Bengali language using masked language modeling (MLM) and Sentence Order Prediction (SOP) objectives.
|
34 |
+
|
35 |
+
## Model description
|
36 |
+
|
37 |
+
<!-- You can embed local or remote images using `![](...)` -->
|
38 |
+
|
39 |
+
sahajBERT is a model composed of 1) a tokenizer specially designed for Bengali and 2) an [ALBERT](https://arxiv.org/abs/1909.11942) architecture collaboratively pre-trained on a dump of Wikipedia in Bengali and the Bengali part of OSCAR.
|
40 |
+
|
41 |
+
<!-- Add more information about the collaborative training when we have time / preprint available -->
|
42 |
+
|
43 |
+
## Intended uses & limitations
|
44 |
+
|
45 |
+
You can use the raw model for either masked language modeling or next sentence prediction, but it's mostly intended to be fine-tuned on a downstream task that use the whole sentence (potentially masked) to make decisions, such as sequence classification, token classification or question answering.
|
46 |
+
|
47 |
+
We trained our model on 2 of these downstream tasks: [sequence classification](https://huggingface.co/neuropark/sahajBERT-NCC) and [token classification](https://huggingface.co/neuropark/sahajBERT-NER)
|
48 |
+
|
49 |
+
#### How to use
|
50 |
+
|
51 |
+
You can use this model directly with a pipeline for masked language modeling:
|
52 |
+
|
53 |
+
```python
|
54 |
+
|
55 |
+
from transformers import AlbertForMaskedLM, FillMaskPipeline, PreTrainedTokenizerFast
|
56 |
+
|
57 |
+
# Initialize tokenizer
|
58 |
+
|
59 |
+
tokenizer = PreTrainedTokenizerFast.from_pretrained("neuropark/sahajBERT")
|
60 |
+
|
61 |
+
# Initialize model
|
62 |
+
|
63 |
+
model = AlbertForMaskedLM.from_pretrained("neuropark/sahajBERT")
|
64 |
+
|
65 |
+
# Initialize pipeline
|
66 |
+
|
67 |
+
pipeline = FillMaskPipeline(tokenizer=tokenizer, model=model)
|
68 |
+
|
69 |
+
raw_text = "ধন্যবাদ। আপনার সাথে কথা [MASK] ভালো লাগলো" # Change me
|
70 |
+
|
71 |
+
pipeline(raw_text)
|
72 |
+
|
73 |
+
```
|
74 |
+
|
75 |
+
Here is how to use this model to get the features of a given text in PyTorch:
|
76 |
+
|
77 |
+
```python
|
78 |
+
|
79 |
+
from transformers import AlbertModel, PreTrainedTokenizerFast
|
80 |
+
|
81 |
+
# Initialize tokenizer
|
82 |
+
|
83 |
+
tokenizer = PreTrainedTokenizerFast.from_pretrained("neuropark/sahajBERT")
|
84 |
+
|
85 |
+
# Initialize model
|
86 |
+
|
87 |
+
model = AlbertModel.from_pretrained("neuropark/sahajBERT")
|
88 |
+
|
89 |
+
text = "ধন্যবাদ। আপনার সাথে কথা বলে ভালো লাগলো" # Change me
|
90 |
+
|
91 |
+
encoded_input = tokenizer(text, return_tensors='pt')
|
92 |
+
|
93 |
+
output = model(**encoded_input
|
94 |
+
|
95 |
+
```
|
96 |
+
|
97 |
+
#### Limitations and bias
|
98 |
+
|
99 |
+
<!-- Provide examples of latent issues and potential remediations. -->
|
100 |
+
|
101 |
+
WIP
|
102 |
+
|
103 |
+
## Training data
|
104 |
+
|
105 |
+
The tokenizer was trained on he Bengali part of OSCAR and the model on a [dump of Wikipedia in Bengali](https://huggingface.co/datasets/lhoestq/wikipedia_bn) and the Bengali part of [OSCAR](https://huggingface.co/datasets/oscar).
|
106 |
+
|
107 |
+
## Training procedure
|
108 |
+
|
109 |
+
This model was trained in a collaborative manner by volunteer participants.
|
110 |
+
|
111 |
+
<!-- Add more information about the collaborative training when we have time / preprint available + Preprocessing, hardware used, hyperparameters... (maybe use figures)-->
|
112 |
+
|
113 |
+
### Contributors leaderboard
|
114 |
+
|
115 |
+
| Rank | Username | Total contributed runtime |
|
116 |
+
|
117 |
+
|:-------------:|:-------------:|-------------:|
|
118 |
+
|
119 |
+
| 1|[khalidsaifullaah](https://huggingface.co/khalidsaifullaah)|11 days 21:02:08|
|
120 |
+
|
121 |
+
| 2|[ishanbagchi](https://huggingface.co/ishanbagchi)|9 days 20:37:00|
|
122 |
+
|
123 |
+
| 3|[tanmoyio](https://huggingface.co/tanmoyio)|9 days 18:08:34|
|
124 |
+
|
125 |
+
| 4|[debajit](https://huggingface.co/debajit)|8 days 14:15:10|
|
126 |
+
|
127 |
+
| 5|[skylord](https://huggingface.co/skylord)|6 days 16:35:29|
|
128 |
+
|
129 |
+
| 6|[ibraheemmoosa](https://huggingface.co/ibraheemmoosa)|5 days 01:05:57|
|
130 |
+
|
131 |
+
| 7|[SaulLu](https://huggingface.co/SaulLu)|5 days 00:46:36|
|
132 |
+
|
133 |
+
| 8|[lhoestq](https://huggingface.co/lhoestq)|4 days 20:11:16|
|
134 |
+
|
135 |
+
| 9|[nilavya](https://huggingface.co/nilavya)|4 days 08:51:51|
|
136 |
+
|
137 |
+
|10|[Priyadarshan](https://huggingface.co/Priyadarshan)|4 days 02:28:55|
|
138 |
+
|
139 |
+
|11|[anuragshas](https://huggingface.co/anuragshas)|3 days 05:00:55|
|
140 |
+
|
141 |
+
|12|[sujitpal](https://huggingface.co/sujitpal)|2 days 20:52:33|
|
142 |
+
|
143 |
+
|13|[manandey](https://huggingface.co/manandey)|2 days 16:17:13|
|
144 |
+
|
145 |
+
|14|[albertvillanova](https://huggingface.co/albertvillanova)|2 days 14:14:31|
|
146 |
+
|
147 |
+
|15|[justheuristic](https://huggingface.co/justheuristic)|2 days 13:20:52|
|
148 |
+
|
149 |
+
|16|[w0lfw1tz](https://huggingface.co/w0lfw1tz)|2 days 07:22:48|
|
150 |
+
|
151 |
+
|17|[smoker](https://huggingface.co/smoker)|2 days 02:52:03|
|
152 |
+
|
153 |
+
|18|[Soumi](https://huggingface.co/Soumi)|1 days 20:42:02|
|
154 |
+
|
155 |
+
|19|[Anjali](https://huggingface.co/Anjali)|1 days 16:28:00|
|
156 |
+
|
157 |
+
|20|[OptimusPrime](https://huggingface.co/OptimusPrime)|1 days 09:16:57|
|
158 |
+
|
159 |
+
|21|[theainerd](https://huggingface.co/theainerd)|1 days 04:48:57|
|
160 |
+
|
161 |
+
|22|[yhn112](https://huggingface.co/yhn112)|0 days 20:57:02|
|
162 |
+
|
163 |
+
|23|[kolk](https://huggingface.co/kolk)|0 days 17:57:37|
|
164 |
+
|
165 |
+
|24|[arnab](https://huggingface.co/arnab)|0 days 17:54:12|
|
166 |
+
|
167 |
+
|25|[imavijit](https://huggingface.co/imavijit)|0 days 16:07:26|
|
168 |
+
|
169 |
+
|26|[osanseviero](https://huggingface.co/osanseviero)|0 days 14:16:45|
|
170 |
+
|
171 |
+
|27|[subhranilsarkar](https://huggingface.co/subhranilsarkar)|0 days 13:04:46|
|
172 |
+
|
173 |
+
|28|[sagnik1511](https://huggingface.co/sagnik1511)|0 days 12:24:57|
|
174 |
+
|
175 |
+
|29|[anindabitm](https://huggingface.co/anindabitm)|0 days 08:56:44|
|
176 |
+
|
177 |
+
|30|[borzunov](https://huggingface.co/borzunov)|0 days 04:07:35|
|
178 |
+
|
179 |
+
|31|[thomwolf](https://huggingface.co/thomwolf)|0 days 03:53:15|
|
180 |
+
|
181 |
+
|32|[priyadarshan](https://huggingface.co/priyadarshan)|0 days 03:40:11|
|
182 |
+
|
183 |
+
|33|[ali007](https://huggingface.co/ali007)|0 days 03:34:37|
|
184 |
+
|
185 |
+
|34|[sbrandeis](https://huggingface.co/sbrandeis)|0 days 03:18:16|
|
186 |
+
|
187 |
+
|35|[Preetha](https://huggingface.co/Preetha)|0 days 03:13:47|
|
188 |
+
|
189 |
+
|36|[Mrinal](https://huggingface.co/Mrinal)|0 days 03:01:43|
|
190 |
+
|
191 |
+
|37|[laxya007](https://huggingface.co/laxya007)|0 days 02:18:34|
|
192 |
+
|
193 |
+
|38|[lewtun](https://huggingface.co/lewtun)|0 days 00:34:43|
|
194 |
+
|
195 |
+
|39|[Rounak](https://huggingface.co/Rounak)|0 days 00:26:10|
|
196 |
+
|
197 |
+
|40|[kshmax](https://huggingface.co/kshmax)|0 days 00:06:38|
|
198 |
+
|
199 |
+
## Eval results
|
200 |
+
|
201 |
+
We evaluate sahajBERT model quality and 2 other model benchmarks ([XLM-R-large](https://huggingface.co/xlm-roberta-large) and [IndicBert](https://huggingface.co/ai4bharat/indic-bert)) by fine-tuning 3 times their pre-trained models on two downstream tasks in Bengali:
|
202 |
+
|
203 |
+
- **NER**: a named entity recognition on Bengali split of [WikiANN](https://huggingface.co/datasets/wikiann) dataset
|
204 |
+
|
205 |
+
- **NCC**: a multi-class classification task on news Soham News Category Classification dataset from IndicGLUE
|
206 |
+
|
207 |
+
| Base pretrained Model | NER - F1 (mean ± std) | NCC - Accuracy (mean ± std) |
|
208 |
+
|
209 |
+
|:-------------:|:-------------:|:-------------:|
|
210 |
+
|
211 |
+
|sahajBERT | 95.45 ± 0.53| 91.97 ± 0.47|
|
212 |
+
|
213 |
+
|[XLM-R-large](https://huggingface.co/xlm-roberta-large) | 96.48 ± 0.22| 90.05 ± 0.38|
|
214 |
+
|
215 |
+
|[IndicBert](https://huggingface.co/ai4bharat/indic-bert) | 92.52 ± 0.45| 74.46 ± 1.91|
|
216 |
+
|
217 |
+
### BibTeX entry and citation info
|
218 |
+
|
219 |
+
Coming soon!
|
220 |
+
|
221 |
+
<!-- ```bibtex
|
222 |
+
|
223 |
+
@inproceedings{...,
|
224 |
+
|
225 |
+
year={2020}
|
226 |
+
|
227 |
+
}
|
228 |
+
|
229 |
+
``` -->
|