cointegrated
commited on
Commit
•
6101dd6
1
Parent(s):
d9aa21f
Create README.md
Browse files
README.md
ADDED
@@ -0,0 +1,67 @@
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
1 |
+
---
|
2 |
+
language:
|
3 |
+
- ba
|
4 |
+
license: apache-2.0
|
5 |
+
tags:
|
6 |
+
- grammatical error correction
|
7 |
+
---
|
8 |
+
|
9 |
+
# Canine-c Bashkir Spelling Correction v1
|
10 |
+
|
11 |
+
This model is a version of [google/canine-c](https://huggingface.co/openai/whisper-small) fine-tuned to fix corrupted texts.
|
12 |
+
It was trained on a mixture of two parallel datasets in the Bashkir language:
|
13 |
+
- sentences post-edited by humans after OCR
|
14 |
+
- artificially randomly corrupted sentences along with their original versions
|
15 |
+
|
16 |
+
For each character, the model predicts whether to replace it and whether to insert another character next to it.
|
17 |
+
|
18 |
+
In this way, the model can be used to fix spelling or OCR errors.
|
19 |
+
|
20 |
+
On a held-out set, it reduces the number of required edits by 40%.
|
21 |
+
|
22 |
+
## How to use
|
23 |
+
|
24 |
+
You can use the model by feeding sentences to the following code:
|
25 |
+
|
26 |
+
```Python
|
27 |
+
import torch
|
28 |
+
from transformers import CanineTokenizer, CanineForTokenClassification
|
29 |
+
|
30 |
+
tokenizer = CanineTokenizer.from_pretrained('slone/canine-c-bashkir-gec-v1')
|
31 |
+
model = CanineForTokenClassification.from_pretrained('slone/canine-c-bashkir-gec-v1')
|
32 |
+
if torch.cuda.is_available():
|
33 |
+
model.cuda()
|
34 |
+
|
35 |
+
LABELS_THIS = [c[5:] for c in model.config.id2label.values() if c.startswith('THIS_')]
|
36 |
+
LABELS_NEXT = [c[5:] for c in model.config.id2label.values() if c.startswith('NEXT_')]
|
37 |
+
|
38 |
+
def fix_text(text, boost=0):
|
39 |
+
"""Apply the model to edit the text. `boost` is a parameter to control edit aggressiveness."""
|
40 |
+
bx = tokenizer(text, return_tensors='pt', padding=True)
|
41 |
+
with torch.inference_mode():
|
42 |
+
out = model(**bx.to(model.device))
|
43 |
+
n1, n2 = len(LABELS_THIS), len(LABELS_NEXT)
|
44 |
+
logits1 = out.logits[0, :, :n1].view(-1, n1)
|
45 |
+
logits2 = out.logits[0, :, n1:].view(-1, n2)
|
46 |
+
if boost:
|
47 |
+
logits1[1:, 0] -= boost
|
48 |
+
logits2[:, 0] -= boost
|
49 |
+
ids1, ids2 = logits1.argmax(-1).tolist(), logits2.argmax(-1).tolist()
|
50 |
+
result = []
|
51 |
+
for c, id1, id2 in zip(' ' + text, ids1, ids2):
|
52 |
+
l1, l2 = LABELS_THIS[id1], LABELS_NEXT[id2]
|
53 |
+
if l1 == 'KEEP':
|
54 |
+
result.append(c)
|
55 |
+
elif l1 != 'DELETE':
|
56 |
+
result.append(l1)
|
57 |
+
if l2 != 'PASS':
|
58 |
+
result.append(l2)
|
59 |
+
return ''.join(result)
|
60 |
+
|
61 |
+
text = 'У йыл дан д ың йөҙө һoрөмлэнде.'
|
62 |
+
print(fix_text(text)) # Уйылдандың йөҙө һөрөмләнде.
|
63 |
+
```
|
64 |
+
|
65 |
+
The parameter `boost` can be used to control the aggressiveness of editing:
|
66 |
+
positive values increase the probability of changing the text, negative values decrease it.
|
67 |
+
|