|
--- |
|
language: |
|
- fa |
|
license: apache-2.0 |
|
base_model: openai/whisper-large-v3 |
|
tags: |
|
- generated_from_trainer |
|
datasets: |
|
- mozilla-foundation-common-voice-17-0 |
|
metrics: |
|
- wer |
|
model-index: |
|
- name: Whisper LargeV3 Persian - Persian ASR |
|
results: |
|
- task: |
|
name: Automatic Speech Recognition |
|
type: automatic-speech-recognition |
|
dataset: |
|
name: common-voice-17-0 |
|
type: mozilla-foundation-common-voice-17-0 |
|
config: default |
|
split: test[:10%] |
|
args: 'config: Persian, split: train[:10%]+validation[:10%]' |
|
metrics: |
|
- name: Wer |
|
type: wer |
|
value: 38.94514767932489 |
|
--- |
|
|
|
<!-- This model card has been generated automatically according to the information the Trainer had access to. You |
|
should probably proofread and complete it, then remove this comment. --> |
|
|
|
# Whisper LargeV3 Persian - Persian ASR |
|
|
|
This model is a fine-tuned version of [openai/whisper-large-v3](https://huggingface.co/openai/whisper-large-v3)on the Common Voice 17.0 dataset in Persian. |
|
The model has been trained for Automatic Speech Recognition (ASR) and is capable of converting spoken Persian into text. |
|
The following sections provide more details on its performance, intended uses, training data, and the procedure followed during training. |
|
It achieves the following results on the evaluation set: |
|
- Loss: 0.4072 |
|
- Wer: 38.9451 |
|
|
|
## Model description |
|
|
|
This model leverages the Whisper architecture, known for its effectiveness in multilingual ASR tasks. |
|
Whisper models are trained on a large corpus of multilingual and multitask supervised data, |
|
enabling them to generalize well across different languages, including low-resource languages like Persian. |
|
This fine-tuned model is specifically adapted for Persian, improving its accuracy on Persian speech recognition tasks. |
|
|
|
## Intended uses & limitations |
|
|
|
This model is designed for speech-to-text tasks in the Persian language. It can be used for applications like transcription of audio files, voice-controlled systems, |
|
and any task requiring accurate conversion of spoken Persian into text. However, the model may have limitations when dealing with noisy audio environments, |
|
diverse accents, or highly technical vocabulary not present in the training data. |
|
It's recommended to fine-tune the model further if your use case involves specialized language or contexts. |
|
|
|
## Training and evaluation data |
|
|
|
The model was fine-tuned using the Common Voice 17.0 dataset, which is a crowd-sourced dataset containing diverse voices in Persian. |
|
The dataset was split into training, validation, and test sets. The training set includes a variety of speakers, ages, and accents, |
|
making the model robust across different dialects of Persian. The test split used for evaluation represents approximately 10% of the total data, ensuring a reliable assessment of the model's performance. |
|
|
|
## Training procedure |
|
|
|
### Training hyperparameters |
|
|
|
The following hyperparameters were used during training: |
|
- learning_rate: 1e-05 |
|
- train_batch_size: 4 |
|
- eval_batch_size: 4 |
|
- seed: 42 |
|
- optimizer: Adam with betas=(0.9,0.999) and epsilon=1e-08,which helps in maintaining stability during training. |
|
- lr_scheduler_type: linear |
|
- lr_scheduler_warmup_steps: 500 |
|
- num_epochs: 1 ,meaning the model was trained over the entire dataset once. |
|
- mixed_precision_training: Native AMP, which allows for faster training by using lower precision without significant loss in accuracy. |
|
|
|
### Training results |
|
|
|
During training, the model achieved the following results: |
|
|
|
- Training Loss: 0.2083 at the end of 1 epoch. |
|
- Validation Loss: 0.4072, showing how well the model generalizes to unseen data. |
|
- Word Error Rate (WER): 38.9451, indicating the percentage of words incorrectly predicted during the ASR task on the validation set. |
|
|
|
| Training Loss | Epoch | Step | Validation Loss | Wer | |
|
|:-------------:|:-----:|:----:|:---------------:|:-------:| |
|
| 0.2083 | 1.0 | 987 | 0.4072 | 38.9451 | |
|
|
|
These results highlight the model's ability to perform well on the given dataset, though there may be room for further optimization and fine-tuning. |
|
|
|
### Framework versions |
|
|
|
The model was trained using the following versions of libraries: |
|
|
|
- Transformers: 4.44.0, which provides the necessary tools and APIs to fine-tune transformer models like Whisper. |
|
- Pytorch: 2.4.0+cu121, the deep learning framework used to build and train the model. |
|
- Datasets: 2.21.0, which facilitated the loading and preprocessing of the Common Voice dataset. |
|
- Tokenizers: 0.19, used for efficiently handling text tokenization required by the model. |
|
|
|
- Transformers 4.44.0 |
|
- Pytorch 2.4.0+cu121 |
|
- Datasets 2.21.0 |
|
- Tokenizers 0.19.1 |
|
|