autoevaluator
HF staff
Add verifyToken field to verify evaluation results are produced by Hugging Face's automatic model evaluator
4be9297
metadata
tags:
- question-answering
datasets:
- squad_v2
metrics:
- f1
- exact
widget:
- context: >-
While deep and large pre-trained models are the state-of-the-art for
various natural language processing tasks, their huge size poses
significant challenges for practical uses in resource constrained
settings. Recent works in knowledge distillation propose task-agnostic as
well as task-specific methods to compress these models, with task-specific
ones often yielding higher compression rate. In this work, we develop a
new task-agnostic distillation framework XtremeDistilTransformers that
leverages the advantage of task-specific methods for learning a small
universal model that can be applied to arbitrary tasks and languages. To
this end, we study the transferability of several source tasks,
augmentation resources and model architecture for distillation. We
evaluate our model performance on multiple tasks, including the General
Language Understanding Evaluation (GLUE) benchmark, SQuAD question
answering dataset and a massive multi-lingual NER dataset with 41
languages.
example_title: xtremedistil q1
text: What is XtremeDistil?
- context: >-
While deep and large pre-trained models are the state-of-the-art for
various natural language processing tasks, their huge size poses
significant challenges for practical uses in resource constrained
settings. Recent works in knowledge distillation propose task-agnostic as
well as task-specific methods to compress these models, with task-specific
ones often yielding higher compression rate. In this work, we develop a
new task-agnostic distillation framework XtremeDistilTransformers that
leverages the advantage of task-specific methods for learning a small
universal model that can be applied to arbitrary tasks and languages. To
this end, we study the transferability of several source tasks,
augmentation resources and model architecture for distillation. We
evaluate our model performance on multiple tasks, including the General
Language Understanding Evaluation (GLUE) benchmark, SQuAD question
answering dataset and a massive multi-lingual NER dataset with 41
languages.
example_title: xtremedistil q2
text: On what is the model validated?
model-index:
- name: nbroad/xdistil-l12-h384-squad2
results:
- task:
type: question-answering
name: Question Answering
dataset:
name: squad_v2
type: squad_v2
config: squad_v2
split: validation
metrics:
- type: exact_match
value: 75.4591
name: Exact Match
verified: true
verifyToken: >-
eyJhbGciOiJFZERTQSIsInR5cCI6IkpXVCJ9.eyJoYXNoIjoiM2QzODE0YTE5ZjMyMWY3NzdjNjcwZDJjY2YyMjBkMWJjMTg3ZDAwYmUwNzU3ZTlkODhmM2VhMWFkY2I2ZjgzMyIsInZlcnNpb24iOjF9.IEjMS4U3uuSP6PfRcD87VFHBIdhoDsIfXkAYV7sz_bveSqhTE16VKJzHaDilCkUCBHYGTjoZ7pDqlYDcF6NKCQ
- type: f1
value: 79.3321
name: F1
verified: true
verifyToken: >-
eyJhbGciOiJFZERTQSIsInR5cCI6IkpXVCJ9.eyJoYXNoIjoiMjAxMDdkNzcxNjAzNzQ4N2MwN2Y3ZDZhOGM5MmU0MzYyOGFjNDM3NjJkNGUzYTkyYmY3MDk1ZGIxYzQ0ZDllMyIsInZlcnNpb24iOjF9.N0jPenoMpxbTzKeJciDfoXiLronfXx3uM-A9NEJCMQ9tiApF-EyNmh4F-G9GBXdbVsq1IZ3MbPto0mn0P9hADQ
- task:
type: question-answering
name: Question Answering
dataset:
name: squad
type: squad
config: plain_text
split: validation
metrics:
- type: exact_match
value: 81.8604
name: Exact Match
verified: true
verifyToken: >-
eyJhbGciOiJFZERTQSIsInR5cCI6IkpXVCJ9.eyJoYXNoIjoiMzRiYjBkYTU0MGRjZDZhNzY2MDZhMGYzZDY2NDU2MTMyMjk0M2YwNTcxZjkyMDNkYTE0YTA5ODVlY2EwOWIxYyIsInZlcnNpb24iOjF9.3jco8t0D7YkHtWHWRttV3y3L0ylQZj3y534HtIW7NuUX34nvVSGMzHVJ32BgaFDomOtnJkaSQFXmumO10FL2BA
- type: f1
value: 89.6654
name: F1
verified: true
verifyToken: >-
eyJhbGciOiJFZERTQSIsInR5cCI6IkpXVCJ9.eyJoYXNoIjoiZjg5YzNmODRlMTM1ZWQ1MjYwYzVkZmJhMzAwMDMzZGQyYzE1MzFlZGFlYmI4Y2JlMTQyNTBkZDRhMWQxYWQ2MCIsInZlcnNpb24iOjF9.Ld2IHVoqmZ-YFx71FgpuoVDEmAAboxRvhke1DhJYLbdIefM-AH60-58OlZcfZGxgUv6fywGjoPCE9g7CxbSzAQ
xtremedistil-l12-h384 trained on SQuAD 2.0
"eval_exact": 75.45691906005221
"eval_f1": 79.32502968532793