Better Pre-trained wav2vec2 models for Welsh Speech Recognition

At the moment, the best Welsh speech recognition wav2vec2 models are achieved from fine-tuning [XLSR-53](https://huggingface.co/facebook/wav2vec2-large-xlsr-53 and xls-r-1b pre-trained models by Facebook/Meta AI.

This model is experimental in investigating better pre-trained models with more Welsh language speech that could in turn lower WER scores even further in subsequent fine-tuned models. It is of very limited use for any fine-tuning on any useful downstream task such as speech recognition.

First Attempts with Self-Supervised Learning

Previous attempts drew heavilty on the resources and documentation from the HuggingFace examples for creating pre-trained wav2vec2 models from scratch:

https://github.com/huggingface/transformers/tree/main/examples/pytorch/speech-pretraining

we used only 4000 hours of Welsh and Engish speech audio collected from various channels on YouTube, The training set contained a balance of approximately 25% Welsh speech and 75% English language speech. The English language data however contains examples of Welsh-accented English speech and therefore was retained for pretraining.

The results of our self-supervised attempts can be accessed from revisions 22.10 and 24.03 of this model repository.

Attempting with Fine-tuning Meta AI models with a very weak data set

The latest attempt invesigates reverting back to fine-tuning Meta AI's pre-trained models (xls-r-1b) with the YouTube speech data having been transcribed automatically with the best Whisper based ASR models for Welsh and English: https://huggingface.co/techiaith/whisper-large-v3-ft-cv-cy-en

The transcriptions are of course not totally correct, hence why we're termed it as a very weak data set. But since it has a much larger collection of speech, and much larger than any other dataset for Welsh we wanted to nevertheless experiment with what impact (if any) the speech audio may still have on the wav2vec2 encoders.

Conclusion

Until we have collected many more hours of speech,

As already mentioned above, the model is not useful for any use. More hours of speech has to be collected. In the meantime, we have have identified issues and limitations in our YouTube data, such as the quality the speech audio and of the automatic transcriptions. Further work is required to correct those issues and/or if is a feasible dataset.