maximxls commited on
Commit
1e46939
1 Parent(s): 6cace96

Create README.md

Browse files
Files changed (1) hide show
  1. README.md +94 -0
README.md ADDED
@@ -0,0 +1,94 @@
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1
+ ---
2
+ license: mit
3
+ language:
4
+ - ru
5
+ library_name: transformers
6
+ tags:
7
+ - text-generation-inference
8
+ ---
9
+ # Model Card for maximxls/text-normalization-ru-terrible
10
+
11
+ Normalization for russian text. Couldn't find any existing solutions (besides algorithms, don't like those) so made this.
12
+
13
+ ## Model Details
14
+
15
+ ### Model Description
16
+
17
+ Tiny T5 trained from scratch for normalizing russian texts:
18
+ - translating numbers into words
19
+ - expanding abbreviations into phonetic letter combinations
20
+ - transliterating english into russian letters
21
+ - whatever else was in the dataset (see below)
22
+
23
+ ### Model Sources
24
+
25
+ - **Training code repository:** https://github.com/maximxlss/text_normalization
26
+ - **Main dataset:** https://www.kaggle.com/c/text-normalization-challenge-russian-language
27
+
28
+ ## Uses
29
+
30
+ Useful in TTS, for example with Silero to make it read numbers and english words (even if not perfectly, it's at least not ignoring)
31
+
32
+ ### Quick Start
33
+
34
+ ```Python
35
+ from transformers import (
36
+ T5ForConditionalGeneration,
37
+ PreTrainedTokenizerFast,
38
+ )
39
+
40
+
41
+ model_path = "maximxls/text-normalization-ru-terrible"
42
+
43
+ tokenizer = PreTrainedTokenizerFast.from_pretrained(model_path)
44
+ model = T5ForConditionalGeneration.from_pretrained(model_path)
45
+
46
+
47
+ example_text = "Я ходил в McDonald's 10 июля 2022 года."
48
+
49
+ inp_ids = tokenizer(
50
+ example_text,
51
+ return_tensors="pt",
52
+ ).input_ids
53
+ out_ids = model.generate(inp_ids, max_new_tokens=128)[0]
54
+ out = tokenizer.decode(out_ids, skip_special_tokens=True)
55
+
56
+ print(out)
57
+ ```
58
+
59
+ `я ходил в макдоналд'эс десятого июля две тысячи двадцать второго года.`
60
+
61
+ ## Bias, Risks, and Limitations
62
+
63
+ **Very much unreliable:**
64
+ - For some reason, sometimes skips over first couple of tokens. Might be benificial to add some extra padding or whatever so it would be more stable. Wasn't able to solve it in training.
65
+ - Sometimes is pretty unstable with repeating or missing words (especially with transliteration)
66
+
67
+ ## Training Details
68
+
69
+ ### Training Data
70
+
71
+ Data from [this Kaggle challenge](https://www.kaggle.com/c/text-normalization-challenge-russian-language) aswell as a bit of extra data written by me.
72
+
73
+ ### Training Procedure
74
+
75
+ #### Preprocessing
76
+
77
+ See [`preprocessing.py`](https://github.com/maximxlss/text_normalization/blob/master/preprocess.py)
78
+
79
+ #### Training Hyperparameters
80
+
81
+ See [`train.py`](https://github.com/maximxlss/text_normalization/blob/master/train.py)
82
+
83
+ I have reset lr manually several times during training, see metrics.
84
+
85
+ #### Details
86
+
87
+ See [`README` on github](https://github.com/maximxlss/text_normalization) for a step-by-step overview of the training procedure.
88
+
89
+ ## Technical Specifications
90
+
91
+ #### Hardware
92
+
93
+ Couple tens of hours of RTX 3090 Ti compute on my personal PC (21.65 epochs)
94
+