bart-base-spelling-nl
This model is a Dutch fine-tuned version of facebook/bart-base.
It achieves the following results on an external evaluation set of human-corrected spelling errors of Dutch snippets of internet text (errors and corrections, run spell.py)
- CER - 0.024
- WER - 0.088
- BLEU - 0.840
- METEOR - 0.932
Note that it is very hard for any spelling corrector to clean more actual spelling errors than introducing new errors. In other words, most spelling correctors cannot be run automatically and must be used interactively.
These are the upper-bound scores when correcting nothing. In other words, this is the actual distance between the errors and their corrections in the evaluation set:
- CER - 0.010
- WER - 0.053
- BLEU - 0.900
- METEOR - 0.954
We are not there yet, clearly.
Model description
This is a fine-tuned version of facebook/bart-base trained on spelling correction. It leans on the excellent work by Oliver Guhr (github, huggingface). Training was performed on an AWS EC2 instance (g5.xlarge) on a single GPU, and took about two days.
Intended uses & limitations
The intended use for this model is to be a component of the Valkuil.net context-sensitive spelling checker.
Training and evaluation data
The model was trained on a Dutch dataset composed of 12,351,203 lines of text, containing a total of 123,131,153 words, from three public Dutch sources, downloaded from the Opus corpus:
- nl-europarlv7.txt (2,387,000 lines)
- nl-opensubtitles2016.9m.txt (9,000,000 lines)
- nl-wikipedia.txt (964,203 lines)
Training procedure
Training hyperparameters
The following hyperparameters were used during training:
- learning_rate: 0.0003
- train_batch_size: 2
- eval_batch_size: 4
- seed: 42
- gradient_accumulation_steps: 16
- total_train_batch_size: 32
- optimizer: Adam with betas=(0.9,0.999) and epsilon=1e-08
- lr_scheduler_type: linear
- num_epochs: 2.0
Framework versions
- Transformers 4.27.3
- Pytorch 2.0.0+cu117
- Datasets 2.10.1
- Tokenizers 0.13.2
- Downloads last month
- 29