Giyaseddin commited on
Commit
53c9a40
1 Parent(s): 3618c93

Add README.md

Browse files
Files changed (1) hide show
  1. README.md +146 -0
README.md CHANGED
@@ -1,3 +1,149 @@
1
  ---
2
  license: apache-2.0
 
 
 
 
 
3
  ---
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1
  ---
2
  license: apache-2.0
3
+ language: en
4
+ library: transformers
5
+ other: distilbert
6
+ datasets:
7
+ - Short Question Answer Assessment Dataset
8
  ---
9
+
10
+ # DistilBERT base uncased model for Short Question Answer Assessment
11
+
12
+ ## Model description
13
+
14
+ DistilBERT is a transformers model, smaller and faster than BERT, which was pretrained on the same corpus in a
15
+ self-supervised fashion, using the BERT base model as a teacher. This means it was pretrained on the raw texts only,
16
+ with no humans labelling them in any way (which is why it can use lots of publicly available data) with an automatic
17
+ process to generate inputs and labels from those texts using the BERT base model.
18
+
19
+ This is a classification model that solves Short Question Answer Assessment task, finetuned [pretrained DistilBERT model](https://huggingface.co/distilbert-base-uncased) on
20
+ [Fake and real news dataset](#)
21
+
22
+ ## Intended uses & limitations
23
+
24
+ This can only be used for the kind of questions and answers provided by that are similar to the ones in the dataset of [Banjade et al.](https://aclanthology.org/W16-0520.pdf).
25
+
26
+
27
+ ### How to use
28
+
29
+ You can use this model directly with a :
30
+
31
+ ```python
32
+ >>> from transformers import pipeline
33
+ >>> classifier = pipeline("text-classification", model="Giyaseddin/distilbert-base-uncased-finetuned-short-answer-assessment", return_all_scores=True)
34
+ >>> context = "To rescue a child who has fallen down a well, rescue workers fasten him to a rope, the other end of which is then reeled in by a machine. The rope pulls the child straight upward at steady speed."
35
+ >>> question = "How does the amount of tension in the rope compare to the downward force of gravity acting on the child?"
36
+ >>> ref_answer = "Since the child is being raised straight upward at a constant speed, the net force on the child is zero and all the forces balance. That means that the tension in the rope balances the downward force of gravity."
37
+ >>> student_answer = "The tension force is higher than the force of gravity."
38
+ >>>
39
+ >>> body = " [SEP] ".join([context, question, ref_answer, student_answer])
40
+ >>> raw_results = classifier([body])
41
+ >>> raw_results
42
+ [[{'label': 'LABEL_0', 'score': 0.0004029414849355817},
43
+ {'label': 'LABEL_1', 'score': 0.0005476847873069346},
44
+ {'label': 'LABEL_2', 'score': 0.998059093952179},
45
+ {'label': 'LABEL_3', 'score': 0.0009902542224153876}]]
46
+ >>> _LABELS_ID2NAME = {0: "correct", 1: "correct_but_incomplete", 2: "contradictory", 3: "incorrect"}
47
+ >>> results = []
48
+ >>> for result in raw_results:
49
+ for score in result:
50
+ results.append([
51
+ {_LABELS_ID2NAME[int(score["label"][-1:])]: "%.2f" % score["score"]}
52
+ ])
53
+
54
+ >>> results
55
+ [[{'correct': '0.00'}],
56
+ [{'correct_but_incomplete': '0.00'}],
57
+ [{'contradictory': '1.00'}],
58
+ [{'incorrect': '0.00'}]]
59
+ ```
60
+
61
+ ### Limitations and bias
62
+
63
+ Even if the training data used for this model could be characterized as fairly neutral, this model can have biased
64
+ predictions. It also inherits some of
65
+ [the bias of its teacher model](https://huggingface.co/bert-base-uncased#limitations-and-bias).
66
+
67
+ This bias will also affect all fine-tuned versions of this model.
68
+
69
+ Also one of the limiations of this model is the length, longer sequences would lead to wrong predictions, due to the pre-processing phase (after concatentating the input sequences, the important student answer might be pruned!)
70
+
71
+ ## Pre-training data
72
+
73
+ DistilBERT pretrained on the same data as BERT, which is [BookCorpus](https://yknzhu.wixsite.com/mbweb), a dataset
74
+ consisting of 11,038 unpublished books and [English Wikipedia](https://en.wikipedia.org/wiki/English_Wikipedia)
75
+ (excluding lists, tables and headers).
76
+
77
+ ## Fine-tuning data
78
+
79
+ The annotated dataset consists of 900 students’ short constructed answers and their correctness in the given context. Four qualitative levels of correctness are defined, correct, correct-but-incomplete, contradictory and Incorrect.
80
+
81
+
82
+ ## Training procedure
83
+
84
+ ### Preprocessing
85
+
86
+ In the preprocessing phase, the following parts are concatenated: _question context_, _question_, _reference_answer_, and _student_answer_ using the separator `[SEP]`.
87
+ This makes the full text as:
88
+
89
+ ```
90
+ [CLS] Context Sentence [SEP] Question Sentence [SEP] Reference Answer Sentence [SEP] Student Answer Sentence [CLS]
91
+ ```
92
+
93
+ The data are splitted according to the following ratio:
94
+ - Training set 80%.
95
+ - Test set 20%.
96
+
97
+ Lables are mapped as: `{0: "correct", 1: "correct_but_incomplete", 2: "contradictory", 3: "incorrect"}`
98
+
99
+ ### Fine-tuning
100
+
101
+ The model was finetuned on GeForce GTX 960M for 20 minuts. The parameters are:
102
+
103
+ | Parameter | Value |
104
+ |:-------------------:|:-----:|
105
+ | Learning rate | 5e-5 |
106
+ | Weight decay | 0.01 |
107
+ | Training batch size | 8 |
108
+ | Epochs | 4 |
109
+
110
+ Here is the scores during the training:
111
+
112
+
113
+ | Epoch | Training Loss | Validation Loss | Accuracy | F1 | Precision | Recall |
114
+ |:----------:|:-------------:|:-----------------:|:----------:|:---------:|:----------:|:--------:|
115
+ | 1 | No log | 0.665765 | 0.755330 | 0.743574 | 0.781210 | 0.755330 |
116
+ | 2 | 0.932100 | 0.362124 | 0.890355 | 0.889875 | 0.891407 | 0.890355 |
117
+ | 3 | 0.364900 | 0.226225 | 0.942132 | 0.941802 | 0.942458 | 0.942132 |
118
+ | 3 | 0.176900 | 0.193660 | 0.954315 | 0.954175 | 0.954985 | 0.954315 |
119
+
120
+ ## Evaluation results
121
+
122
+ When fine-tuned on downstream task of fake news binary classification, this model achieved the following results:
123
+ (scores are rounded to 2 floating points)
124
+
125
+
126
+ | | precision | recall | f1-score | support |
127
+ |:------------------------:|:----------:|:-------:|:--------:|:-------:|
128
+ | _correct_ | 0.938 | 0.989 | 0.963 | 366 |
129
+ | _correct_but_incomplete_ | 0.975 | 0.922 | 0.948 | 257 |
130
+ | _contradictory_ | 0.946 | 0.938 | 0.942 | 113 |
131
+ | _incorrect_ | 0.963 | 0.944 | 0.953 | 249 |
132
+ | accuracy | - | - | 0.954 | 985 |
133
+ | macro avg | 0.956 | 0.948 | 0.952 | 985 |
134
+ | weighted avg | 0.955 | 0.954 | 0.954 | 985 |
135
+
136
+ Confision matrix:
137
+
138
+
139
+
140
+ | Actual \ Predicted | _correct_ | _correct_but_incomplete_ | _contradictory_ | _incorrect_ |
141
+ |:------------------------:|:---------:|:------------------------:|:---------------:|:-----------:|
142
+ | _correct_ | 362 | 4 | 0 | 0 |
143
+ | _correct_but_incomplete_ | 13 | 237 | 0 | 7 |
144
+ | _contradictory_ | 4 | 1 | 106 | 2 |
145
+ | _incorrect_ | 7 | 1 | 6 | 235 |
146
+
147
+
148
+
149
+ The AUC score is: 'micro'= **0.9695** and 'macro': **0.9659**