julien-c HF staff commited on
Commit
7a948e2
1 Parent(s): 83c6c0a

Migrate model card from transformers-repo

Browse files

Read announcement at https://discuss.huggingface.co/t/announcement-all-model-cards-will-be-migrated-to-hf-co-model-repos/2755
Original file history: https://github.com/huggingface/transformers/commits/master/model_cards/flexudy/t5-base-multi-sentence-doctor/README.md

Files changed (1) hide show
  1. README.md +109 -0
README.md ADDED
@@ -0,0 +1,109 @@
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1
+ ![avatar](sent-banner.png)
2
+
3
+ # Sentence-Doctor
4
+ Sentence doctor is a T5 model that attempts to correct the errors or mistakes found in sentences. Model works on English, German and French text.
5
+
6
+ ## 1. Problem:
7
+ Many NLP models depend on tasks like *Text Extraction Libraries, OCR, Speech to Text libraries* and **Sentence Boundary Detection**
8
+ As a consequence errors caused by these tasks in your NLP pipeline can affect the quality of models in applications. Especially since models are often trained on **clean** input.
9
+
10
+ ## 2. Solution:
11
+ Here we provide a model that **attempts** to reconstruct sentences based on the its context (sourrounding text). The task is pretty straightforward:
12
+ * `Given an "erroneous" sentence, and its context, reconstruct the "intended" sentence`.
13
+
14
+ ## 3. Use Cases:
15
+ * Attempt to repair noisy sentences that where extracted with OCR software or text extractors.
16
+ * Attempt to repair sentence boundaries.
17
+ * Example (in German): **Input: "und ich bin im**",
18
+ * Prefix_Context: "Hallo! Mein Name ist John", Postfix_Context: "Januar 1990 geboren."
19
+ * Output: "John und ich bin im Jahr 1990 geboren"
20
+ * Possibly sentence level spelling correction -- Although this is not the intended use.
21
+ * Input: "I went to church **las yesteday**" => Output: "I went to church last Sunday".
22
+
23
+ ## 4. Disclaimer
24
+ Note how we always emphises on the word *attempt*. The current version of the model was only trained on **150K** sentences from the tatoeba dataset: https://tatoeba.org/eng. (50K per language -- En, Fr, De).
25
+ Hence, we strongly encourage you to finetune the model on your dataset. We might release a version trained on more data.
26
+
27
+ ## 5. Datasets
28
+ We generated synthetic data from the tatoeba dataset: https://tatoeba.org/eng. Randomly applying different transformations on words and characters based on some probabilities. The datasets are available in the data folder (where **sentence_doctor_dataset_300K** is a larger dataset with 100K sentences for each language).
29
+
30
+ ## 6. Usage
31
+
32
+ ### 6.1 Preprocessing
33
+ * Let us assume we have the following text (Note that there are no punctuation marks in the text):
34
+
35
+ ```python
36
+ text = "That is my job I am a medical doctor I save lives"
37
+ ```
38
+ * You decided extract the sentences and for some obscure reason, you obtained these sentences:
39
+
40
+ ```python
41
+ sentences = ["That is my job I a", "m a medical doct", "I save lives"]
42
+ ```
43
+ * You now wish to correct the sentence **"m a medical doct"**.
44
+
45
+ Here is the single preprocessing step for the model:
46
+
47
+ ```python
48
+ input_text = "repair_sentence: " + sentences[1] + " context: {" + sentences[0] + "}{" + sentences[2] + "} </s>"
49
+ ```
50
+
51
+ **Explanation**:</br>
52
+ * We are telling the model to repair the sentence with the prefix "repair_sentence: "
53
+ * Then append the sentence we want to repair **sentence[1]** which is "m a medical doct"
54
+ * Next we give some context to the model. In the case, the context is some text that occured before the sentence and some text that appeard after the sentence in the original text.
55
+ * To do that, we append the keyword "context :"
56
+ * Append **{sentence[0]}** "{That is my job I a}". (Note how it is sourrounded by curly braces).
57
+ * Append **{sentence[2]}** "{I save lives}".
58
+ * At last we tell the model this is the end of the input with </s>.
59
+
60
+ ```python
61
+ print(input_text) # repair_sentence: m a medical doct context: {That is my job I a}{or I save lives} </s>
62
+ ```
63
+
64
+ <br/>
65
+
66
+ **The context is optional**, so the input could also be ```repair_sentence: m a medical doct context: {}{} </s>```
67
+
68
+ ### 6.2 Inference
69
+
70
+ ```python
71
+
72
+ from transformers import AutoTokenizer, AutoModelWithLMHead
73
+
74
+ tokenizer = AutoTokenizer.from_pretrained("flexudy/t5-base-multi-sentence-doctor")
75
+
76
+ model = AutoModelWithLMHead.from_pretrained("flexudy/t5-base-multi-sentence-doctor")
77
+
78
+ input_text = "repair_sentence: m a medical doct context: {That is my job I a}{or I save lives} </s>"
79
+
80
+ input_ids = tokenizer.encode(input_text, return_tensors="pt")
81
+
82
+ outputs = model.generate(input_ids, max_length=32, num_beams=1)
83
+
84
+ sentence = tokenizer.decode(outputs[0], skip_special_tokens=True, clean_up_tokenization_spaces=True)
85
+
86
+ assert sentence == "I am a medical doctor."
87
+ ```
88
+
89
+ ## 7. Fine-tuning
90
+ We also provide a script `train_any_t5_task.py` that might help you fine-tune any Text2Text Task with T5. We added #TODO comments all over to help you use train with ease. For example:
91
+
92
+ ```python
93
+ # TODO Set your training epochs
94
+ config.TRAIN_EPOCHS = 3
95
+ ```
96
+ If you don't want to read the #TODO comments, just pass in your data like this
97
+
98
+ ```python
99
+ # TODO Where is your data ? Enter the path
100
+ trainer.start("data/sentence_doctor_dataset_300.csv")
101
+ ```
102
+ and voila!! Please feel free to correct any mistakes in the code and make a pull request.
103
+
104
+ ## 8. Attribution
105
+ * [Huggingface](https://huggingface.co/) transformer lib for making this possible
106
+ * Abhishek Kumar Mishra's transformer [tutorial](https://github.com/abhimishra91/transformers-tutorials/blob/master/transformers_summarization_wandb.ipynb) on text summarisation. Our training code is just a modified version of their code. So many thanks.
107
+ * We finetuned this model from the huggingface hub: WikinewsSum/t5-base-multi-combine-wiki-news. Thanks to the [authors](https://huggingface.co/WikinewsSum)
108
+ * We also read a lot of work from [Suraj Patil](https://github.com/patil-suraj)
109
+ * No one has been forgotten, hopefully :)