metadata

license: cc-by-nc-4.0
language: ddn
metrics:
  - wer
tags:
  - text-to-audio
  - automatic-speech-recognition
  - wav2vec2-fine-tuning
  - dendi-text-to-speech
model-index:
  - name: Dendi Numerals ASR
    results:
      - task:
          name: Speech Recognition
          type: automatic-speech-recognition
        dataset:
          name: dendi
          type: dendi_numbers_dataset
        metrics:
          - name: Test WER
            type: wer
            value: 18.18
pipeline_tag: automatic-speech-recognition

CreaTiv Team (CTT): Dendi Numerals Automatic Speech Recognition

This repository contains an Automatic Speech Recognition (ASR) model specifically for recognizing numerals in the Dendi (ddn) language. The model can accurately recognize numbers ranging from 0 to 1,000,000,000 when spoken in Dendi.

This model is part of Creativ Team's Noulinmon project, a user-friendly mobile app designed to make calculations accessible in six local languages of Benin, featuring voice reading and AI capabilities. You can find more CTT-ASR models on the Hugging Face Hub: ssid32/ctt-asr.

CTT-ASR is available in the 🤗 Transformers library from version 4.4 onwards.

Model Details

The model is a fine-tuned version of facebook/wav2vec2-large-xlsr-53 on Dendi. When using this model, make sure that your speech input is sampled at 16kHz.

Usage

To use this model, first install the latest version of 🤗 Transformers library:

pip install --upgrade transformers accelerate

Then, run inference with the following code-snippet:

import torch
import torchaudio
from transformers import Wav2Vec2ForCTC, Wav2Vec2Processor

processor = Wav2Vec2Processor.from_pretrained("ssid32/wav2vec2-xlsr-dendi-ddn-for-numerals") 
model = Wav2Vec2ForCTC.from_pretrained("ssid32/wav2vec2-xlsr-dendi-ddn-for-numerals")

speech_array, sampling_rate = torchaudio.load("audio_test.wav")
speech_array = speech_array.squeeze().numpy()
inputs = processor(speech_array, sampling_rate=16_000, return_tensors="pt", padding=True)

with torch.no_grad():
  logits = model(inputs.input_values, attention_mask=inputs.attention_mask).logits
  output = processor.batch_decode(torch.argmax(logits, dim=-1))

print("Output:", output)

You can listen to the sample audio here:

Upon processing the sample audio, the model produces the following output:

Output: ['zangu ihaaku nda weiguu']

In this case, the output represents the numeral 850 in the Dendi language.

Evaluation result

The model's performance on a test set yields a Word Error Rate (WER) of 18.18%.

Authors

This model was developed by:

Salim KORA GUERA (HuggingFace Username: ssid32) | ([email protected])
Etienne TOVIMAFA (HuggingFace Username: MrBendji) | ([email protected])

Citation

@misc {
    author       = { {Salim KORA GUERA and Etienne TOVIMAFA} },
    title        = { wav2vec2-xlsr-dendi-ddn-for-numerals },
    year         = 2024,
    url          = { https://huggingface.co/ssid32/wav2vec2-xlsr-dendi-ddn-for-numerals },
    doi          = { 10.57967/hf/2930 },
    publisher    = { Hugging Face }
}

License

The model is licensed as CC-BY-NC 4.0.