Wav2vec2-large

Model description

This model is a pre-trained instance of the Wav2vec 2.0 architecture, specifically focused on processing and understanding four major African languages: Fongbe, Swahili, Amharic, and Wolof. The model leverages unlabelled audio data in these languages to learn rich, language-specific representations before any fine-tuning on downstream tasks.

Training data

The model was pre-trained using a diverse set of audio recordings from ALFFA dataset.

Fongbe: A Gbe language, primarily spoken in Benin and parts of Nigeria and Togo.
Swahili: A Bantu language, widely spoken across East Africa including Tanzania, Kenya, Uganda, Rwanda, and Burundi.
Amharic: The official language of Ethiopia, belonging to the Semitic branch of the Afroasiatic language family.
Wolof: Predominantly spoken in Senegal, The Gambia, and Mauritania.

Model Architecture

This model uses the large version of wav2vec 2.0 architecture developed by Facebook AI, which includes a multi-layer convolutional neural network that processes raw audio signals to produce contextual representations. These representations are then used to predict the original audio input before any labels are provided, following a self-supervised training methodology.

Usage

This model is intended for use in Automatic Speech Recognition (ASR), audio classification, and other audio-related tasks in Fongbe, Swahili, Amharic, and Wolof. To use this model for fine-tuning on a specific task, you can load it via the Hugging Face Transformers library:

from transformers import Wav2Vec2Processor, Wav2Vec2Model

processor = Wav2Vec2Processor.from_pretrained("your-username/wav2vec2-african-languages")
model = Wav2Vec2Model.from_pretrained("your-username/wav2vec2-african-languages")

Performance

The model's performance was evaluated using a held-out validation set of audio recordings. The effectiveness of the pre-trained representations was measured in terms of their ability to be fine-tuned to specific tasks such as ASR. Note that detailed performance metrics will depend on the specifics of the fine-tuning process and the quality of the labeled data used.

Limitations

The model might exhibit variability in performance across different languages due to varying amounts of training data available for each language. Performance may degrade with audio inputs that significantly differ from the types of recordings seen during training (e.g., telephone quality audio, noisy environments).