anonymoussubmitter222
added description, title
c237ceb

A newer version of the Gradio SDK is available: 5.6.0

Upgrade
metadata
title: Tunisian Speech Recognition
emoji: πŸŒ™
colorFrom: red
colorTo: white
sdk: gradio
sdk_version: 3.16.1
app_file: app.py
pinned: false
license: cc-by-nc-3.0

Overview

This project aims to create an Automatic Speech Recognition (ASR) model dedicated for the Tunisian Arabic dialect. The goal is to improve speech recognition technology for underrepresented linguistic communities by transcribing Tunisian dialect speech into written text.

Dataset

Part of the audio and text data (The ones we collected) used to train and test the model has been provided to encourage and support research within the community. Please find the dataset here. This Zenodo record contains labeled and unlabeled Tunisian Arabic audio data, along with textual data for language modelling. The folder also contains a 4-gram language model trained with KenLM on data released within the Zenodo record. The .arpa file is called "outdomain.arpa".

Performance

The following table summarizes the performance of the model on various considered test sets :

Dataset CER WER
TARIC 6.22% 10.55%
IWSLT 21.18% 39.53%
TunSwitch TO 9.67% 25.54%

More details about the test sets, and the conditions leading to this performance in the paper.

Team

Here are the team members who have contributed to this project

Paper

More in-depth details and insights are available in a released preprint. Please find the paper here. If you use or refer to this model, please cite :

@misc{abdallah2023leveraging,
      title={Leveraging Data Collection and Unsupervised Learning for Code-switched Tunisian Arabic Automatic Speech Recognition}, 
      author={Ahmed Amine Ben Abdallah and Ata Kabboudi and Amir Kanoun and Salah Zaiem},
      year={2023},
      eprint={2309.11327},
      archivePrefix={arXiv},
      primaryClass={eess.AS}
}

Datasets

This ASR model was trained on :

  • TARIC : The corpus, named TARIC (Tunisian Arabic Railway Interaction Corpus) has a collection of audio recordings and transcriptions from dialogues in the Tunisian Railway Transport Network. - Taric Corpus -
  • IWSLT : A Tunisian conversational speech - IWSLT Corpus-
  • TunSwitch : Our crowd-collected dataset described in the paper presented above.

Demo

Here is a working live demo : LINK

Inference

1. Create a CSV test file

First, you have to create a csv file that follows SpeechBrain's format which contain 4 columns:

  • ID: contain ID to identify each audio sample in the dataset
  • wav: contain the path to the audio file
  • wrd: contain the text transcription of the spoken content in the audio file if you have it and use your set for evaluation. Put anything if you don't have transcriptions. An example is provided in this folder, the file is called : taric_test.csv
  • duration: the duration of the audio in seconds

2. Adjust the hyperparams.yaml file

Adjust the path of test_csv parameter to your csv file path

To run this recipe, do the following:

> python train_with_wavlm.py semi_wavlm_large_tunisian_ctc/1234/hyperparams.yaml --test_csv = path_to_csv

If you want to infer on single files, the space demo offers proper easy-to-use inference code.

Contact :

If you have questions, you can send an email to : [email protected]