benjleite commited on
Commit
3c4f12d
1 Parent(s): 09c33dc

Update README.md

Browse files
Files changed (1) hide show
  1. README.md +174 -3
README.md CHANGED
@@ -1,3 +1,174 @@
1
- ---
2
- license: apache-2.0
3
- ---
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1
+ ---
2
+ language:
3
+ - es
4
+ tags:
5
+ - t5s
6
+ - Spanish
7
+ - text-generation
8
+ - question-answering
9
+ datasets:
10
+ - GEM/FairytaleQA
11
+ - benjleite/FairytaleQA-translated-spanish
12
+ license: apache-2.0
13
+ pipeline_tag: text-generation
14
+ ---
15
+
16
+ # Model Card for t5s-spanish-qa
17
+
18
+ ## Model Description
19
+
20
+ **t5s-spanish-qa** is a T5-based model, fine-tuned from [T5S](https://huggingface.co/vgaraujov/t5-base-spanish) in the **Spanish** [machine-translated version](https://huggingface.co/datasets/benjleite/FairytaleQA-translated-spanish) of the [original English FairytaleQA dataset](https://huggingface.co/datasets/GEM/FairytaleQA).
21
+ The task of fine-tuning is Question Answering. You can check our [paper](https://arxiv.org/abs/2406.04233), accepted in ECTEL 2024.
22
+
23
+ ## Training Data
24
+ **FairytaleQA** is an open-source dataset designed to enhance comprehension of narratives, aimed at students from kindergarten to eighth grade. The dataset is meticulously annotated by education experts following an evidence-based theoretical framework. It comprises 10,580 explicit and implicit questions derived from 278 child-friendly stories, covering seven types of narrative elements or relations.
25
+
26
+ ## Implementation Details
27
+
28
+ The encoder concatenates the question and text, and the decoder generates the answer. We use special labels to differentiate the components. Our maximum token input is set to 512, while the maximum token output is set to 128. During training, the models undergo a maximum of 20 epochs and incorporate early stopping with a patience of 2. A batch size of 16 is employed. During inference, we utilize beam search with a beam width of 5.
29
+
30
+ ## Evaluation - Question Answering
31
+
32
+ | Model | ROUGEL-F1 |
33
+ | ---------------- | ---------- |
34
+ | t5 (for original english dataset, baseline) | 0.551 |
35
+ | t5s-spanish-qa (for the spanish machine-translated dataset) | 0.382 |
36
+
37
+ ## Load Model and Tokenizer
38
+
39
+ ```py
40
+ >>> from transformers import T5ForConditionalGeneration, T5Tokenizer
41
+ >>> model = T5ForConditionalGeneration.from_pretrained("benjleite/t5s-spanish-qa")
42
+ >>> tokenizer = T5Tokenizer.from_pretrained("vgaraujov/t5-base-spanish", model_max_length=512)
43
+ ```
44
+ **Important Note**: Special tokens need to be added and model tokens must be resized:
45
+
46
+ ```py
47
+ >>> tokenizer.add_tokens(['<nar>', '<atributo>', '<pregunta>', '<respuesta>', '<tiporespuesta>', '<texto>'], special_tokens=True)
48
+ >>> model.resize_token_embeddings(len(tokenizer))
49
+ ```
50
+
51
+ ## Inference Example (same parameters as used in paper experiments)
52
+
53
+ Note: See our [repository](https://github.com/bernardoleite/fairytaleqa-translated) for additional code details.
54
+
55
+ ```py
56
+ input_text = '<pregunta>' + '¿Quién era Oso?' + '<texto>' + 'Érase una vez un oso al que le gustaba pasear por el bosque...'
57
+
58
+ source_encoding = tokenizer(
59
+ input_text,
60
+ max_length=512,
61
+ padding='max_length',
62
+ truncation = 'only_second',
63
+ return_attention_mask=True,
64
+ add_special_tokens=True,
65
+ return_tensors='pt'
66
+ )
67
+
68
+ input_ids = source_encoding['input_ids']
69
+ attention_mask = source_encoding['attention_mask']
70
+
71
+ generated_ids = model.generate(
72
+ input_ids=input_ids,
73
+ attention_mask=attention_mask,
74
+ num_return_sequences=1,
75
+ num_beams=5,
76
+ max_length=512,
77
+ repetition_penalty=1.0,
78
+ length_penalty=1.0,
79
+ early_stopping=True,
80
+ use_cache=True
81
+ )
82
+
83
+ prediction = {
84
+ tokenizer.decode(generated_id, skip_special_tokens=False, clean_up_tokenization_spaces=True)
85
+ for generated_id in generated_ids
86
+ }
87
+
88
+ generated_str = ''.join(preds)
89
+
90
+ print(generated_str)
91
+ ```
92
+
93
+ ## Licensing Information
94
+
95
+ This fine-tuned model is released under the [Apache-2.0 License](http://www.apache.org/licenses/LICENSE-2.0).
96
+
97
+ ## Citation Information
98
+
99
+ Our paper (preprint - accepted for publication at ECTEL 2024):
100
+
101
+ ```
102
+ @article{leite_fairytaleqa_translated_2024,
103
+ title={FairytaleQA Translated: Enabling Educational Question and Answer Generation in Less-Resourced Languages},
104
+ author={Bernardo Leite and Tomás Freitas Osório and Henrique Lopes Cardoso},
105
+ year={2024},
106
+ eprint={2406.04233},
107
+ archivePrefix={arXiv},
108
+ primaryClass={cs.CL}
109
+ }
110
+ ```
111
+
112
+ Original FairytaleQA paper:
113
+
114
+ ```
115
+ @inproceedings{xu-etal-2022-fantastic,
116
+ title = "Fantastic Questions and Where to Find Them: {F}airytale{QA} {--} An Authentic Dataset for Narrative Comprehension",
117
+ author = "Xu, Ying and
118
+ Wang, Dakuo and
119
+ Yu, Mo and
120
+ Ritchie, Daniel and
121
+ Yao, Bingsheng and
122
+ Wu, Tongshuang and
123
+ Zhang, Zheng and
124
+ Li, Toby and
125
+ Bradford, Nora and
126
+ Sun, Branda and
127
+ Hoang, Tran and
128
+ Sang, Yisi and
129
+ Hou, Yufang and
130
+ Ma, Xiaojuan and
131
+ Yang, Diyi and
132
+ Peng, Nanyun and
133
+ Yu, Zhou and
134
+ Warschauer, Mark",
135
+ editor = "Muresan, Smaranda and
136
+ Nakov, Preslav and
137
+ Villavicencio, Aline",
138
+ booktitle = "Proceedings of the 60th Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers)",
139
+ month = may,
140
+ year = "2022",
141
+ address = "Dublin, Ireland",
142
+ publisher = "Association for Computational Linguistics",
143
+ url = "https://aclanthology.org/2022.acl-long.34",
144
+ doi = "10.18653/v1/2022.acl-long.34",
145
+ pages = "447--460",
146
+ abstract = "Question answering (QA) is a fundamental means to facilitate assessment and training of narrative comprehension skills for both machines and young children, yet there is scarcity of high-quality QA datasets carefully designed to serve this purpose. In particular, existing datasets rarely distinguish fine-grained reading skills, such as the understanding of varying narrative elements. Drawing on the reading education research, we introduce FairytaleQA, a dataset focusing on narrative comprehension of kindergarten to eighth-grade students. Generated by educational experts based on an evidence-based theoretical framework, FairytaleQA consists of 10,580 explicit and implicit questions derived from 278 children-friendly stories, covering seven types of narrative elements or relations. Our dataset is valuable in two folds: First, we ran existing QA models on our dataset and confirmed that this annotation helps assess models{'} fine-grained learning skills. Second, the dataset supports question generation (QG) task in the education domain. Through benchmarking with QG models, we show that the QG model trained on FairytaleQA is capable of asking high-quality and more diverse questions.",
147
+ }
148
+ ```
149
+
150
+ T5S model:
151
+
152
+ ```
153
+ @inproceedings{araujo-etal-2024-sequence-sequence,
154
+ title = "Sequence-to-Sequence {S}panish Pre-trained Language Models",
155
+ author = "Araujo, Vladimir and
156
+ Trusca, Maria Mihaela and
157
+ Tufi{\~n}o, Rodrigo and
158
+ Moens, Marie-Francine",
159
+ editor = "Calzolari, Nicoletta and
160
+ Kan, Min-Yen and
161
+ Hoste, Veronique and
162
+ Lenci, Alessandro and
163
+ Sakti, Sakriani and
164
+ Xue, Nianwen",
165
+ booktitle = "Proceedings of the 2024 Joint International Conference on Computational Linguistics, Language Resources and Evaluation (LREC-COLING 2024)",
166
+ month = may,
167
+ year = "2024",
168
+ address = "Torino, Italia",
169
+ publisher = "ELRA and ICCL",
170
+ url = "https://aclanthology.org/2024.lrec-main.1283",
171
+ pages = "14729--14743",
172
+ abstract = "In recent years, significant advancements in pre-trained language models have driven the creation of numerous non-English language variants, with a particular emphasis on encoder-only and decoder-only architectures. While Spanish language models based on BERT and GPT have demonstrated proficiency in natural language understanding and generation, there remains a noticeable scarcity of encoder-decoder models explicitly designed for sequence-to-sequence tasks, which aim to map input sequences to generate output sequences conditionally. This paper breaks new ground by introducing the implementation and evaluation of renowned encoder-decoder architectures exclusively pre-trained on Spanish corpora. Specifically, we present Spanish versions of BART, T5, and BERT2BERT-style models and subject them to a comprehensive assessment across various sequence-to-sequence tasks, including summarization, question answering, split-and-rephrase, dialogue, and translation. Our findings underscore the competitive performance of all models, with the BART- and T5-based models emerging as top performers across all tasks. We have made all models publicly available to the research community to foster future explorations and advancements in Spanish NLP: https://github.com/vgaraujov/Seq2Seq-Spanish-PLMs.",
173
+ }
174
+ ```