robvanderg
commited on
Commit
•
1a42625
1
Parent(s):
378e92c
Create README.md
Browse files
README.md
ADDED
@@ -0,0 +1,23 @@
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
1 |
+
---
|
2 |
+
language:
|
3 |
+
- multilingual
|
4 |
+
tags:
|
5 |
+
- hack
|
6 |
+
datasets:
|
7 |
+
- Wikipedia
|
8 |
+
---
|
9 |
+
|
10 |
+
|
11 |
+
## bert-base-multilingual-cased-segment1BERT
|
12 |
+
|
13 |
+
This is a version of multilingual bert (bert-base-multilingual-cased), where the segment embedding of the 1's is copied into the 0's. Yes, that's all there is to it. We have found that this improves performance substantially in low-resource setups for word-level tasks (e.g. average 2.5 LAS on a variety of UD treebanks). More details are to be released in our LREC2022 paper titled: Frustratingly Easy Performance Improvements for Cross-lingual Transfer: A Tale on BERT and Segment Embeddings.
|
14 |
+
|
15 |
+
These embeddings are generated by the following code
|
16 |
+
```
|
17 |
+
import AutoModel
|
18 |
+
baseEmbeddings = AutoModel.from_pretrained("bert-base-multilingual-cased")
|
19 |
+
tte = baseEmbeddings.embeddings.token_type_embeddings.weight.clone().detach()
|
20 |
+
baseEmbeddings.embeddings.token_type_embeddings.weight[0,:] = tte[1,:]
|
21 |
+
```
|
22 |
+
|
23 |
+
More details and other varieties can be found in the repo: https://bitbucket.org/robvanderg/segmentembeds/
|