update readme
Browse files- README.txt +46 -9
README.txt
CHANGED
@@ -1,21 +1,36 @@
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
1 |
# Uncased Finnish Sentence BERT model
|
2 |
|
3 |
-
Finnish Sentence BERT trained from FinBERT
|
4 |
|
5 |
## Training
|
6 |
|
7 |
-
|
8 |
-
|
9 |
-
|
10 |
-
|
|
|
11 |
|
12 |
## Usage
|
13 |
|
14 |
-
The same as in [HuggingFace documentation]
|
15 |
|
16 |
### SentenceTransformer
|
17 |
|
18 |
-
```
|
19 |
from sentence_transformers import SentenceTransformer
|
20 |
sentences = ["Tämä on esimerkkilause.", "Tämä on toinen lause."]
|
21 |
|
@@ -26,12 +41,12 @@ print(embeddings)
|
|
26 |
|
27 |
### HuggingFace Transformers
|
28 |
|
29 |
-
```
|
30 |
from transformers import AutoTokenizer, AutoModel
|
31 |
import torch
|
32 |
|
33 |
|
34 |
-
#Mean Pooling - Take attention mask into account for correct averaging
|
35 |
def mean_pooling(model_output, attention_mask):
|
36 |
token_embeddings = model_output[0] #First element of model_output contains all token embeddings
|
37 |
input_mask_expanded = attention_mask.unsqueeze(-1).expand(token_embeddings.size()).float()
|
@@ -58,3 +73,25 @@ sentence_embeddings = mean_pooling(model_output, encoded_input['attention_mask']
|
|
58 |
print("Sentence embeddings:")
|
59 |
print(sentence_embeddings)
|
60 |
```
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
1 |
+
---
|
2 |
+
language:
|
3 |
+
- fi
|
4 |
+
pipeline_tag: sentence-similarity
|
5 |
+
tags:
|
6 |
+
- sentence-transformers
|
7 |
+
- feature-extraction
|
8 |
+
- sentence-similarity
|
9 |
+
- transformers
|
10 |
+
widget:
|
11 |
+
- text: "Minusta täällä on ihana asua!"
|
12 |
+
---
|
13 |
+
|
14 |
+
|
15 |
# Uncased Finnish Sentence BERT model
|
16 |
|
17 |
+
Finnish Sentence BERT trained from FinBERT. A demo on retrieving the most similar sentences from a dataset of 400 million sentences *using [the cased model]*(https://huggingface.co/TurkuNLP/sbert-cased-finnish-paraphrase) can be found [here](http://epsilon-it.utu.fi/sbert400m).
|
18 |
|
19 |
## Training
|
20 |
|
21 |
+
- Library: [sentence-transformers](https://www.sbert.net/)
|
22 |
+
- FinBERT model: TurkuNLP/bert-base-finnish-uncased-v1
|
23 |
+
- Data: The data provided [here](https://turkunlp.org/paraphrase.html), including the Finnish Paraphrase Corpus and the automatically collected paraphrase candidates (500K positive and 5M negative)
|
24 |
+
- Pooling: mean pooling
|
25 |
+
- Task: Binary prediction, whether two sentences are paraphrases or not. Note: the labels 3 and 4 are considered paraphrases, and labels 1 and 2 non-paraphrases. [Details on labels](https://aclanthology.org/2021.nodalida-main.29/)
|
26 |
|
27 |
## Usage
|
28 |
|
29 |
+
The same as in [HuggingFace documentation](https://huggingface.co/sentence-transformers/bert-base-nli-mean-tokens). Either through `SentenceTransformer` or `HuggingFace Transformers`
|
30 |
|
31 |
### SentenceTransformer
|
32 |
|
33 |
+
```python
|
34 |
from sentence_transformers import SentenceTransformer
|
35 |
sentences = ["Tämä on esimerkkilause.", "Tämä on toinen lause."]
|
36 |
|
|
|
41 |
|
42 |
### HuggingFace Transformers
|
43 |
|
44 |
+
```python
|
45 |
from transformers import AutoTokenizer, AutoModel
|
46 |
import torch
|
47 |
|
48 |
|
49 |
+
# Mean Pooling - Take attention mask into account for correct averaging
|
50 |
def mean_pooling(model_output, attention_mask):
|
51 |
token_embeddings = model_output[0] #First element of model_output contains all token embeddings
|
52 |
input_mask_expanded = attention_mask.unsqueeze(-1).expand(token_embeddings.size()).float()
|
|
|
73 |
print("Sentence embeddings:")
|
74 |
print(sentence_embeddings)
|
75 |
```
|
76 |
+
|
77 |
+
## Evaluation Results
|
78 |
+
|
79 |
+
A publication detailing the evaluation results is currently being drafted.
|
80 |
+
|
81 |
+
## Full Model Architecture
|
82 |
+
|
83 |
+
```
|
84 |
+
SentenceTransformer(
|
85 |
+
(0): Transformer({'max_seq_length': 128, 'do_lower_case': True}) with Transformer model: BertModel
|
86 |
+
(1): Pooling({'word_embedding_dimension': 768, 'pooling_mode_cls_token': False, 'pooling_mode_mean_tokens': True, 'pooling_mode_max_tokens': False, 'pooling_mode_mean_sqrt_len_tokens': False})
|
87 |
+
)
|
88 |
+
```
|
89 |
+
|
90 |
+
## Citing & Authors
|
91 |
+
While the publication is being drafted, please cite [this page](https://turkunlp.org/paraphrase.html).
|
92 |
+
|
93 |
+
## References
|
94 |
+
|
95 |
+
- J. Kanerva, F. Ginter, LH. Chang, I. Rastas, V. Skantsi, J. Kilpeläinen, HM. Kupari, J. Saarni, M. Sevón, and O. Tarkka. Finnish Paraphrase Corpus. In *NoDaLiDa 2021*, 2021.
|
96 |
+
- N. Reimers and I. Gurevych. Sentence-BERT: Sentence embeddings using Siamese BERT-networks. In *EMNLP-IJCNLP*, pages 3982–3992, 2019.
|
97 |
+
- A. Virtanen, J. Kanerva, R. Ilo, J. Luoma, J. Luotolahti, T. Salakoski, F. Ginter, and S. Pyysalo. Multilingual is not enough: BERT for Finnish. *arXiv preprint arXiv:1912.07076*, 2019.
|