File size: 1,289 Bytes
ab1128c |
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 |
This is finetune version of [SimCSE: Simple Contrastive Learning of Sentence Embeddings](https://arxiv.org/abs/2104.08821)
, train unsupervised on 570K stroke sentences from : stroke books, quora medical, quora's stroke and human annotates.
### Extract sentence representation
```
from transformers import AutoTokenizer, AutoModel
tokenizer = AutoTokenizer.from_pretrained("demdecuong/stroke_simcse")
model = AutoModel.from_pretrained("demdecuong/stroke_simcse")
text = "What are disease related to red stroke's causes?"
inputs = tokenizer(text, return_tensors='pt')
outputs = model(**inputs)[1]
```
### Build up embedding for database
```
database = [
'What is the daily checklist for stroke returning home',
'What are some tips for stroke adapt new life',
'What should I consider when using nursing-home care'
]
embedding = torch.zeros((len(database),768))
for i in range(len(database)):
inputs = tokenizer(database[i], return_tensors="pt")
outputs = model(**inputs)[1]
embedding[i] = outputs
print(embedding.shape)
```
### Result
On our Poc testset , which contains pairs of matching question related to stroke from human-generated.
| Model | Top-1 Accuracy |
| ------------- | ------------- |
| SimCSE (supervised) | 75.83 |
| SimCSE (ours) | 76.66 | |