Update README.md
Browse files
README.md
CHANGED
@@ -41,64 +41,6 @@ It achieves the following results on the evaluation set:
|
|
41 |
Please check the [official repository](https://github.com/microsoft/DeBERTa) for more details and updates.
|
42 |
In [DeBERTa V3](https://arxiv.org/abs/2111.09543), we replaced the MLM objective with the RTD(Replaced Token Detection) objective introduced by ELECTRA for pre-training, as well as some innovations to be introduced in our upcoming paper. Compared to DeBERTa-V2, our V3 version significantly improves the model performance in downstream tasks. You can find a simple introduction about the model from the appendix A11 in our original [paper](https://arxiv.org/abs/2006.03654), but we will provide more details in a separate write-up.
|
43 |
The DeBERTa V3 small model comes with 6 layers and a hidden size of 768. Its total parameter number is 143M since we use a vocabulary containing 128K tokens which introduce 98M parameters in the Embedding layer. This model was trained using the 160GB data as DeBERTa V2.
|
44 |
-
#### Fine-tuning on NLU tasks
|
45 |
-
We present the dev results on SQuAD 1.1/2.0 and MNLI tasks.
|
46 |
-
| Model | SQuAD 1.1 | SQuAD 2.0 | MNLI-m |
|
47 |
-
|-------------------|-----------|-----------|--------|
|
48 |
-
| RoBERTa-base | 91.5/84.6 | 83.7/80.5 | 87.6 |
|
49 |
-
| XLNet-base | -/- | -/80.2 | 86.8 |
|
50 |
-
|DeBERTa-base |93.1/87.2| 86.2/83.1| 88.8|
|
51 |
-
| **DeBERTa-v3-small** | -/- | -/- | 88.2 |
|
52 |
-
| DeBERTa-v3-small+SiFT | -/- | -/- | 88.8 |
|
53 |
-
#### Fine-tuning with HF transformers
|
54 |
-
```bash
|
55 |
-
#!/bin/bash
|
56 |
-
cd transformers/examples/pytorch/text-classification/
|
57 |
-
pip install datasets
|
58 |
-
export TASK_NAME=mnli
|
59 |
-
output_dir="ds_results"
|
60 |
-
num_gpus=8
|
61 |
-
batch_size=8
|
62 |
-
python -m torch.distributed.launch --nproc_per_node=${num_gpus} \
|
63 |
-
run_glue.py \
|
64 |
-
--model_name_or_path microsoft/deberta-v3-small \
|
65 |
-
--task_name $TASK_NAME \
|
66 |
-
--do_train \
|
67 |
-
--do_eval \
|
68 |
-
--evaluation_strategy steps \
|
69 |
-
--max_seq_length 256 \
|
70 |
-
--warmup_steps 1000 \
|
71 |
-
--per_device_train_batch_size ${batch_size} \
|
72 |
-
--learning_rate 3e-5 \
|
73 |
-
--num_train_epochs 3 \
|
74 |
-
--output_dir $output_dir \
|
75 |
-
--overwrite_output_dir \
|
76 |
-
--logging_steps 1000 \
|
77 |
-
--logging_dir $output_dir
|
78 |
-
```
|
79 |
-
### Citation
|
80 |
-
If you find DeBERTa useful for your work, please cite the following paper:
|
81 |
-
``` latex
|
82 |
-
@misc{he2021debertav3,
|
83 |
-
title={DeBERTaV3: Improving DeBERTa using ELECTRA-Style Pre-Training with Gradient-Disentangled Embedding Sharing},
|
84 |
-
author={Pengcheng He and Jianfeng Gao and Weizhu Chen},
|
85 |
-
year={2021},
|
86 |
-
eprint={2111.09543},
|
87 |
-
archivePrefix={arXiv},
|
88 |
-
primaryClass={cs.CL}
|
89 |
-
}
|
90 |
-
```
|
91 |
-
``` latex
|
92 |
-
@inproceedings{
|
93 |
-
he2021deberta,
|
94 |
-
title={DEBERTA: DECODING-ENHANCED BERT WITH DISENTANGLED ATTENTION},
|
95 |
-
author={Pengcheng He and Xiaodong Liu and Jianfeng Gao and Weizhu Chen},
|
96 |
-
booktitle={International Conference on Learning Representations},
|
97 |
-
year={2021},
|
98 |
-
url={https://openreview.net/forum?id=XPZIaotutsD}
|
99 |
-
}
|
100 |
-
```
|
101 |
-
|
102 |
|
103 |
## Intended uses & limitations
|
104 |
|
|
|
41 |
Please check the [official repository](https://github.com/microsoft/DeBERTa) for more details and updates.
|
42 |
In [DeBERTa V3](https://arxiv.org/abs/2111.09543), we replaced the MLM objective with the RTD(Replaced Token Detection) objective introduced by ELECTRA for pre-training, as well as some innovations to be introduced in our upcoming paper. Compared to DeBERTa-V2, our V3 version significantly improves the model performance in downstream tasks. You can find a simple introduction about the model from the appendix A11 in our original [paper](https://arxiv.org/abs/2006.03654), but we will provide more details in a separate write-up.
|
43 |
The DeBERTa V3 small model comes with 6 layers and a hidden size of 768. Its total parameter number is 143M since we use a vocabulary containing 128K tokens which introduce 98M parameters in the Embedding layer. This model was trained using the 160GB data as DeBERTa V2.
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
44 |
|
45 |
## Intended uses & limitations
|
46 |
|