Ashegh-Sad-Warrior's picture
Update README.md
3b32d4d verified
metadata
language:
  - fa
license: apache-2.0
base_model: openai/whisper-large-v3
tags:
  - generated_from_trainer
datasets:
  - mozilla-foundation-common-voice-17-0
metrics:
  - wer
model-index:
  - name: Whisper LargeV3 Persian - Persian ASR
    results:
      - task:
          name: Automatic Speech Recognition
          type: automatic-speech-recognition
        dataset:
          name: common-voice-17-0
          type: mozilla-foundation-common-voice-17-0
          config: default
          split: test[:10%]
          args: 'config: Persian, split: train[:10%]+validation[:10%]'
        metrics:
          - name: Wer
            type: wer
            value: 38.94514767932489

Whisper LargeV3 Persian - Persian ASR

This model is a fine-tuned version of openai/whisper-large-v3on the Common Voice 17.0 dataset in Persian. The model has been trained for Automatic Speech Recognition (ASR) and is capable of converting spoken Persian into text. The following sections provide more details on its performance, intended uses, training data, and the procedure followed during training. It achieves the following results on the evaluation set:

  • Loss: 0.4072
  • Wer: 38.9451

Model description

This model leverages the Whisper architecture, known for its effectiveness in multilingual ASR tasks. Whisper models are trained on a large corpus of multilingual and multitask supervised data, enabling them to generalize well across different languages, including low-resource languages like Persian. This fine-tuned model is specifically adapted for Persian, improving its accuracy on Persian speech recognition tasks.

Intended uses & limitations

This model is designed for speech-to-text tasks in the Persian language. It can be used for applications like transcription of audio files, voice-controlled systems, and any task requiring accurate conversion of spoken Persian into text. However, the model may have limitations when dealing with noisy audio environments, diverse accents, or highly technical vocabulary not present in the training data. It's recommended to fine-tune the model further if your use case involves specialized language or contexts.

Training and evaluation data

The model was fine-tuned using the Common Voice 17.0 dataset, which is a crowd-sourced dataset containing diverse voices in Persian. The dataset was split into training, validation, and test sets. The training set includes a variety of speakers, ages, and accents, making the model robust across different dialects of Persian. The test split used for evaluation represents approximately 10% of the total data, ensuring a reliable assessment of the model's performance.

Training procedure

Training hyperparameters

The following hyperparameters were used during training:

  • learning_rate: 1e-05
  • train_batch_size: 4
  • eval_batch_size: 4
  • seed: 42
  • optimizer: Adam with betas=(0.9,0.999) and epsilon=1e-08,which helps in maintaining stability during training.
  • lr_scheduler_type: linear
  • lr_scheduler_warmup_steps: 500
  • num_epochs: 1 ,meaning the model was trained over the entire dataset once.
  • mixed_precision_training: Native AMP, which allows for faster training by using lower precision without significant loss in accuracy.

Training results

During training, the model achieved the following results:

  • Training Loss: 0.2083 at the end of 1 epoch.
  • Validation Loss: 0.4072, showing how well the model generalizes to unseen data.
  • Word Error Rate (WER): 38.9451, indicating the percentage of words incorrectly predicted during the ASR task on the validation set.
Training Loss Epoch Step Validation Loss Wer
0.2083 1.0 987 0.4072 38.9451

These results highlight the model's ability to perform well on the given dataset, though there may be room for further optimization and fine-tuning.

Framework versions

The model was trained using the following versions of libraries:

  • Transformers: 4.44.0, which provides the necessary tools and APIs to fine-tune transformer models like Whisper.

  • Pytorch: 2.4.0+cu121, the deep learning framework used to build and train the model.

  • Datasets: 2.21.0, which facilitated the loading and preprocessing of the Common Voice dataset.

  • Tokenizers: 0.19, used for efficiently handling text tokenization required by the model.

  • Transformers 4.44.0

  • Pytorch 2.4.0+cu121

  • Datasets 2.21.0

  • Tokenizers 0.19.1