simran-kh commited on
Commit
dbe35ff
1 Parent(s): b802783

Create README.md

Browse files
Files changed (1) hide show
  1. README.md +173 -0
README.md ADDED
@@ -0,0 +1,173 @@
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1
+ # MuRIL Large
2
+ Multilingual Representations for Indian Languages : A BERT Large (24L) model pre-trained on 17 Indian languages, and their transliterated counterparts.
3
+
4
+ ## Overview
5
+
6
+ This model uses a BERT large architecture [1] pretrained from scratch using the
7
+ Wikipedia [2], Common Crawl [3], PMINDIA [4] and Dakshina [5] corpora for 17 [6]
8
+ Indian languages.
9
+
10
+ We use a training paradigm similar to multilingual bert, with a few
11
+ modifications as listed:
12
+
13
+ * We include translation and transliteration segment pairs in training as
14
+ well.
15
+ * We keep an exponent value of 0.3 and not 0.7 for upsampling, shown to
16
+ enhance low-resource performance. [7]
17
+
18
+ See the Training section for more details.
19
+
20
+ ## Training
21
+
22
+ The MuRIL model is pre-trained on monolingual segments as well as parallel
23
+ segments as detailed below :
24
+
25
+ * Monolingual Data : We make use of publicly available corpora from Wikipedia
26
+ and Common Crawl for 17 Indian languages.
27
+ * Parallel Data : We have two types of parallel data :
28
+ * Translated Data : We obtain translations of the above monolingual
29
+ corpora using the Google NMT pipeline. We feed translated segment pairs
30
+ as input. We also make use of the publicly available PMINDIA corpus.
31
+ * Transliterated Data : We obtain transliterations of Wikipedia using the
32
+ IndicTrans [8] library. We feed transliterated segment pairs as input.
33
+ We also make use of the publicly available Dakshina dataset.
34
+
35
+ We keep an exponent value of 0.3 to calculate duplication multiplier values for
36
+ upsampling of lower resourced languages and set dupe factors accordingly. Note,
37
+ we limit transliterated pairs to Wikipedia only.
38
+
39
+ The model was trained using a self-supervised masked language modeling task. We
40
+ do whole word masking with a maximum of 80 predictions. The model was trained
41
+ for 1500K steps, with a batch size of 8192, and a max sequence length of 512.
42
+
43
+ ### Trainable parameters
44
+
45
+ All parameters in the module are trainable, and fine-tuning all parameters is
46
+ the recommended practice.
47
+
48
+
49
+ ## Uses & Limitations
50
+
51
+ This model is intended to be used for a variety of downstream NLP tasks for
52
+ Indian languages. This model is trained on transliterated data as well, a
53
+ phenomenon commonly observed in the Indian context. This model is not expected
54
+ to perform well on languages other than the ones used in pre-training, i.e. 17
55
+ Indian languages.
56
+
57
+ ## Evaluation
58
+
59
+ We provide the results of fine-tuning this model on a set of downstream tasks.<br/>
60
+ We choose these tasks from the XTREME benchmark, with evaluation done on Indian language test-sets.<br/>
61
+ All results are computed in a zero-shot setting, with English being the high resource training set language.<br/>
62
+ The results for XLM-R (Large) are taken from the XTREME paper [9].
63
+
64
+ * Shown below are results on datasets from the XTREME benchmark (in %)
65
+ <br/>
66
+
67
+ PANX (F1) | bn | en | hi | ml | mr | ta | te | ur | Average
68
+ :------------ | ---: | ---: | ---: | ---: | ---: | ---: | ---: | ---: | ------:
69
+ XLM-R (large) | 78.8 | 84.7 | 73.0 | 67.8 | 68.1 | 59.5 | 55.8 | 56.4 | 68.0
70
+ MuRIL (large) | 85.8 | 85.0 | 78.3 | 75.6 | 77.3 | 71.1 | 65.6 | 83.0 | 77.7
71
+
72
+ <br/>
73
+
74
+ UDPOS (F1) | en | hi | mr | ta | te | ur | Average
75
+ :------------ | ---: | ---: | ---: | ---: | ---: | ---: | ------:
76
+ XLM-R (large) | 96.1 | 76.4 | 80.8 | 65.2 | 86.6 | 70.3 | 79.2
77
+ MuRIL (large) | 95.7 | 71.3 | 85.7 | 62.6 | 85.8 | 62.8 | 77.3
78
+
79
+ <br/>
80
+
81
+ XNLI (Accuracy) | en | hi | ur | Average
82
+ :-------------- | ---: | ---: | ---: | ------:
83
+ XLM-R (large) | 88.7 | 75.6 | 71.7 | 78.7
84
+ MuRIL (large) | 88.4 | 75.8 | 71.7 | 78.6
85
+
86
+ <br/>
87
+
88
+ XQUAD (F1/EM) | en | hi | Average
89
+ :------------ | --------: | --------: | --------:
90
+ XLM-R (large) | 86.5/75.7 | 76.7/59.7 | 81.6/67.7
91
+ MuRIL (large) | 88.2/77.8 | 78.4/62.4 | 83.3/70.1
92
+
93
+ <br/>
94
+
95
+ MLQA (F1/EM) | en | hi | Average
96
+ :------------ | --------: | --------: | --------:
97
+ XLM-R (large) | 83.5/70.6 | 70.6/53.1 | 77.1/61.9
98
+ MuRIL (large) | 84.4/71.7 | 72.2/54.1 | 78.3/62.9
99
+
100
+ <br/>
101
+
102
+ TyDiQA (F1/EM) | en | bn | te | Average
103
+ :------------- | --------: | --------: | --------: | --------:
104
+ XLM-R (large) | 71.5/56.8 | 64.0/47.8 | 70.1/43.6 | 68.5/49.4
105
+ MuRIL (large) | 75.9/66.8 | 67.1/53.1 | 71.5/49.8 | 71.5/56.6
106
+
107
+ <br/>
108
+
109
+ The fine-tuning hyperparameters are as follows:
110
+
111
+ Task | Batch Size | Learning Rate | Epochs | Warm-up Ratio
112
+ :----- | ---------: | ------------: | -----: | ------------:
113
+ PANX | 32 | 2e-5 | 10 | 0.1
114
+ UDPOS | 64 | 5e-6 | 10 | 0.1
115
+ XNLI | 128 | 2e-5 | 5 | 0.1
116
+ XQuAD | 32 | 3e-5 | 2 | 0.1
117
+ MLQA | 32 | 3e-5 | 2 | 0.1
118
+ TyDiQA | 32 | 3e-5 | 3 | 0.1
119
+
120
+ ## References
121
+
122
+ \[1]: Jacob Devlin, Ming-Wei Chang, Kenton Lee, Kristina Toutanova. [BERT:
123
+ Pre-training of Deep Bidirectional Transformers for Language
124
+ Understanding](https://arxiv.org/abs/1810.04805). arXiv preprint
125
+ arXiv:1810.04805, 2018.
126
+
127
+ \[2]: [Wikipedia](https://www.tensorflow.org/datasets/catalog/wikipedia)
128
+
129
+ \[3]: [Common Crawl](http://commoncrawl.org/the-data/)
130
+
131
+ \[4]:
132
+ [PMINDIA](http://lotus.kuee.kyoto-u.ac.jp/WAT/indic-multilingual/index.html)
133
+
134
+ \[5]: [Dakshina](https://github.com/google-research-datasets/dakshina)
135
+
136
+ \[6]: Assamese (as), Bengali (bn), English (en), Gujarati (gu), Hindi (hi),
137
+ Kannada (kn), Kashmiri (ks), Malayalam (ml), Marathi (mr), Nepali (ne), Oriya
138
+ (or), Punjabi (pa), Sanskrit (sa), Sindhi (sd), Tamil (ta), Telugu (te) and Urdu
139
+ (ur).
140
+
141
+ \[7]: Conneau, Alexis, et al.
142
+ [Unsupervised cross-lingual representation learning at scale](https://arxiv.org/pdf/1911.02116.pdf).
143
+ arXiv preprint arXiv:1911.02116 (2019).
144
+
145
+ \[8]: [IndicTrans](https://github.com/libindic/indic-trans)
146
+
147
+ \[9]: Hu, J., Ruder, S., Siddhant, A., Neubig, G., Firat, O., & Johnson, M.
148
+ (2020). [Xtreme: A massively multilingual multi-task benchmark for evaluating
149
+ cross-lingual generalization.](https://arxiv.org/pdf/2003.11080.pdf) arXiv
150
+ preprint arXiv:2003.11080.
151
+
152
+ \[10]: Fang, Y., Wang, S., Gan, Z., Sun, S., & Liu, J. (2020).
153
+ [FILTER: An Enhanced Fusion Method for Cross-lingual Language Understanding.](https://arxiv.org/pdf/2009.05166.pdf)
154
+ arXiv preprint arXiv:2009.05166.
155
+
156
+ ## Citation
157
+
158
+ If you find MuRIL useful in your applications, please cite the following paper:
159
+
160
+ ```
161
+ @misc{khanuja2021muril,
162
+ title={MuRIL: Multilingual Representations for Indian Languages},
163
+ author={Simran Khanuja and Diksha Bansal and Sarvesh Mehtani and Savya Khosla and Atreyee Dey and Balaji Gopalan and Dilip Kumar Margam and Pooja Aggarwal and Rajiv Teja Nagipogu and Shachi Dave and Shruti Gupta and Subhash Chandra Bose Gali and Vish Subramanian and Partha Talukdar},
164
+ year={2021},
165
+ eprint={2103.10730},
166
+ archivePrefix={arXiv},
167
+ primaryClass={cs.CL}
168
+ }
169
+ ```
170
+
171
+ ## Contact
172
+
173
+ Please mail your queries/feedback to [email protected].