autoevaluator
HF staff
Add verifyToken field to verify evaluation results are produced by Hugging Face's automatic model evaluator
a8bbb79
metadata
language: en
license: cc-by-4.0
tags:
- question-answering
datasets:
- squad_v2
metrics:
- f1
- exact
widget:
- context: >-
DeBERTa improves the BERT and RoBERTa models using disentangled attention
and enhanced mask decoder. With those two improvements, DeBERTa out
perform RoBERTa on a majority of NLU tasks with 80GB training data. In
DeBERTa V3, we further improved the efficiency of DeBERTa using
ELECTRA-Style pre-training with Gradient Disentangled Embedding Sharing.
Compared to DeBERTa, our V3 version significantly improves the model
performance on downstream tasks. You can find more technique details about
the new model from our paper. Please check the official repository for
more implementation details and updates.
example_title: DeBERTa v3 Q1
text: How is DeBERTa version 3 different than previous ones?
- context: >-
DeBERTa improves the BERT and RoBERTa models using disentangled attention
and enhanced mask decoder. With those two improvements, DeBERTa out
perform RoBERTa on a majority of NLU tasks with 80GB training data. In
DeBERTa V3, we further improved the efficiency of DeBERTa using
ELECTRA-Style pre-training with Gradient Disentangled Embedding Sharing.
Compared to DeBERTa, our V3 version significantly improves the model
performance on downstream tasks. You can find more technique details about
the new model from our paper. Please check the official repository for
more implementation details and updates.
example_title: DeBERTa v3 Q2
text: Where do I go to see new info about DeBERTa?
model-index:
- name: DeBERTa v3 xsmall squad2
results:
- task:
type: question-answering
name: Question Answering
dataset:
name: SQuAD2.0
type: question-answering
metrics:
- type: f1
value: 81.5
name: f1
- type: exact
value: 78.3
name: exact
- task:
type: question-answering
name: Question Answering
dataset:
name: squad_v2
type: squad_v2
config: squad_v2
split: validation
metrics:
- type: exact_match
value: 78.5341
name: Exact Match
verified: true
verifyToken: >-
eyJhbGciOiJFZERTQSIsInR5cCI6IkpXVCJ9.eyJoYXNoIjoiZTk0ZGQ1YjU1YmQ5NTc2M2RmNjg2OGViYjcyODZkOTc1MDBkNmI5MDc0MzEyMzZmNDg3Yzc4ZTA3ZjAwM2M5ZiIsInZlcnNpb24iOjF9.ewKF-UetUoxKDeXgnM6vqy8nBC9c3qh7dLZhdQlgSxPut3LjAhpCh2fJGir-OVcfzWzxsPhcZQEpdnxR8oZnAA
- type: f1
value: 81.6408
name: F1
verified: true
verifyToken: >-
eyJhbGciOiJFZERTQSIsInR5cCI6IkpXVCJ9.eyJoYXNoIjoiOTQwZDdjY2ZlOGVhM2E5NGM3OGNkNTk2NWFkYTg1Y2Q0YWFlYWJmMGIyZWM5ZjMyYTYyODUzMDA0NWU0ZGVkZCIsInZlcnNpb24iOjF9.BHJNhS1YisUIkjcpIMdwXurTewak9dkkpGXC2vHvUB4qUEuk_p3V-orhmeFyTxzLaWRwrZVGVz-NSfqFr4n1Ag
- type: total
value: 11870
name: total
verified: true
verifyToken: >-
eyJhbGciOiJFZERTQSIsInR5cCI6IkpXVCJ9.eyJoYXNoIjoiNzNiZDQ3MDAyNzljMDI4NTRlYzZiZjE4ODJhZDhmZWE2ZjcwNjg2ZWJmNjUyMTUzZDk4ODNjNDExYTk1YWNlOCIsInZlcnNpb24iOjF9.3BlfmMvbV86Ua39ToqnMmgpGS0ZTew0UFFYWGyTkS3u7jaAXCfYkFkNJXw806f2uFFkKr1hqlzzKfivV0wUjCg
- task:
type: question-answering
name: Question Answering
dataset:
name: squad
type: squad
config: plain_text
split: validation
metrics:
- type: exact_match
value: 84.1741
name: Exact Match
verified: true
verifyToken: >-
eyJhbGciOiJFZERTQSIsInR5cCI6IkpXVCJ9.eyJoYXNoIjoiYTA0MDVlYWI5NzdiNjllM2NmZTYwYmQ5YzE0ODgwOTA3MWZjZDkxNDFmZDM1OTQzMzgwNWI4NDc5NThhM2VhZSIsInZlcnNpb24iOjF9.lc2nUBxSu2_0_a5lyVsV51UAmkE8WHDTwGHvt3n9zvCbcJ1ylOg2xovF0_j0hZS16lv1DEw5XV8EW_ZS7mfvBg
- type: f1
value: 91.0771
name: F1
verified: true
verifyToken: >-
eyJhbGciOiJFZERTQSIsInR5cCI6IkpXVCJ9.eyJoYXNoIjoiODQxMjkxOWJlZTc2MmE5YzVmMjNhOTkwNDdiMDBhNWUwMDU3MDI1MmJiNDY4MjczYjIwM2U1NDhlYmZlZWQwMSIsInZlcnNpb24iOjF9.x_axHiBX5d3UIi1UbJT3kVbdX4kX9XFLQSg-l16-AAK9tiyutT-yaYJOi8LSb2lR4677tJpf3itu4eriJRU2Cg
DeBERTa v3 xsmall SQuAD 2.0
Microsoft reports that this model can get 84.8/82.0 on f1/em on the dev set.
I got 81.5/78.3 but I only did one run and I didn't use the official squad2 evaluation script. I will do some more runs and show the results on the official script soon.