Macedonian Wav2Vec2 Model

This repository contains a Wav2Vec2 model trained and fine-tuned on a custom dataset for the Macedonian language. The model is part of the HuggingFace library, a popular open-source library for natural language processing.

Model Details

The Wav2Vec2 model is a state-of-the-art automatic speech recognition (ASR) model that converts spoken language into written text. This particular model has been trained and fine-tuned specifically for the Macedonian language, making it highly accurate and suitable for various speech-to-text applications in Macedonian.

How to Use

Installation

You can install the required dependencies by using pip, the Python package installer:

pip install transformers

Usage Example

Here's a simple example demonstrating how to use the Macedonian Wav2Vec2 model for speech recognition:


from transformers import Wav2Vec2ForCTC, Wav2Vec2Tokenizer

Load the model and tokenizer

model = Wav2Vec2ForCTC.from_pretrained("Konstantin-Bogdanoski/wav2vec2-macedonian-base")
tokenizer = Wav2Vec2Tokenizer.from_pretrained("Konstantin-Bogdanoski/wav2vec2-macedonian-base")

Perform speech recognition on an audio file

file_path = "path/to/audio.wav"
input_audio = tokenizer(file_path, return_tensors="pt").input_values
transcription = model(input_audio).logits.argmax(-1)

Convert token IDs to text

transcription_text = tokenizer.decode(transcription[0])
print("Transcription:", transcription_text)

Make sure to replace "path/to/audio.wav" with the actual path to your audio file. The transcription_text variable will contain the recognized text from the speech. Model Fine-Tuning

If you're interested in fine-tuning the Wav2Vec2 model on your own custom dataset for the Macedonian language, you can refer to the HuggingFace documentation.

Audio data format

The audio used by to train the model is in a certain format, with a sample rate of 22050, the following python snippet transforms the audio to the valid format (snippet is from a Jupyter notebook, cells are divided with a line of "="):

## Read audio from *user*

audio, sr = get_audio() # This function needs to be implemented to read audio from the microphone

print(audio)
print(sr)

================================================================================================

### Changing the sampling rate to correspond to the processor's sample rate

import numpy as np
from scipy.io import wavfile
from scipy import interpolate

================================================================================================

NEW_SAMPLERATE = 22050

old_samplerate = sr
old_audio = audio

if sr != NEW_SAMPLERATE:
    duration = old_audio.shape[0] / old_samplerate

    time_old  = np.linspace(0, duration, old_audio.shape[0])
    time_new  = np.linspace(0, duration, int(old_audio.shape[0] * NEW_SAMPLERATE / old_samplerate))

    interpolator = interpolate.interp1d(time_old, old_audio.T)
    new_audio = interpolator(time_new).T

    new_audio = wavfile.write("out.wav", NEW_SAMPLERATE, np.round(new_audio).astype(old_audio.dtype))

================================================================================================

import soundfile
import torch

batch = {}

speech_array, sampling_rate = soundfile.read("out.wav")

# speech_array, sampling_rate = soundfile.read("out.wav")

batch["speech"] = speech_array
batch["sampling_rate"] = sampling_rate
batch["target_text"] = ""
batch["input_values"] = torch.from_numpy(np.asarray(processor(speech_array, sampling_rate=22050).input_values)).to("cuda")

with processor.as_target_processor():
  batch["labels"] = processor("").input_ids

================================================================================================

batch['input_values'][0]

================================================================================================

batch["sampling_rate"]

================================================================================================

get_audio() is a function we used in Google Colab to read audio from the user's microphone. The following code does that (Note: this code can only be used in Google Colab, you need to apply changes to use it in a local environment):

"""
JS Script and python code which read audio from user
"""
from IPython.display import HTML, Audio
from google.colab.output import eval_js
from base64 import b64decode
import numpy as np
from scipy.io.wavfile import read as wav_read
import io
import ffmpeg

AUDIO_HTML = """
<script>
var my_div = document.createElement("DIV");
var my_p = document.createElement("P");
var my_btn = document.createElement("BUTTON");
var t = document.createTextNode("Press to start recording");

my_btn.appendChild(t);
//my_p.appendChild(my_btn);
my_div.appendChild(my_btn);
document.body.appendChild(my_div);

var base64data = 0;
var reader;
var recorder, gumStream;
var recordButton = my_btn;

var handleSuccess = function(stream) {
  gumStream = stream;
  var options = {
    //bitsPerSecond: 8000, //chrome seems to ignore, always 48k
    mimeType : 'audio/webm;codecs=opus'
    //mimeType : 'audio/webm;codecs=pcm'
  };
  //recorder = new MediaRecorder(stream, options);
  recorder = new MediaRecorder(stream);
  recorder.ondataavailable = function(e) {
    var url = URL.createObjectURL(e.data);
    var preview = document.createElement('audio');
    preview.controls = true;
    preview.src = url;
    document.body.appendChild(preview);

    reader = new FileReader();
    reader.readAsDataURL(e.data);
    reader.onloadend = function() {
      base64data = reader.result;
      //console.log("Inside FileReader:" + base64data);
    }
  };
  recorder.start();
  };

recordButton.innerText = "Recording... press to stop";

navigator.mediaDevices.getUserMedia({audio: true}).then(handleSuccess);


function toggleRecording() {
  if (recorder && recorder.state == "recording") {
      recorder.stop();
      gumStream.getAudioTracks()[0].stop();
      recordButton.innerText = "Saving the recording... pls wait!"
  }
}

// https://stackoverflow.com/a/951057
function sleep(ms) {
  return new Promise(resolve => setTimeout(resolve, ms));
}

var data = new Promise(resolve=>{
//recordButton.addEventListener("click", toggleRecording);
recordButton.onclick = ()=>{
toggleRecording()

sleep(2000).then(() => {
  // wait 2000ms for the data to be available...
  // ideally this should use something like await...
  //console.log("Inside data:" + base64data)
  resolve(base64data.toString())

});

}
});

</script>
"""

def get_audio():
  display(HTML(AUDIO_HTML))
  data = eval_js("data")
  binary = b64decode(data.split(',')[1])

  process = (ffmpeg
    .input('pipe:0')
    .output('pipe:1', format='wav')
    .run_async(pipe_stdin=True, pipe_stdout=True, pipe_stderr=True, quiet=True, overwrite_output=True)
  )
  output, err = process.communicate(input=binary)

  riff_chunk_size = len(output) - 8
  # Break up the chunk size into four bytes, held in b.
  q = riff_chunk_size
  b = []
  for i in range(4):
      q, r = divmod(q, 256)
      b.append(r)

  # Replace bytes 4:8 in proc.stdout with the actual size of the RIFF chunk.
  riff = output[:4] + bytes(b) + output[8:]

  sr, audio = wav_read(io.BytesIO(riff))

  return audio, sr

Acknowledgments

We would like to acknowledge the creators of the Wav2Vec2 model and the HuggingFace library for their valuable contributions to the field of automatic speech recognition.

If you have any questions or encounter any issues, please feel free to open an issue in this repository. We are here to help!

Citation

To cite this model and research paper, use the following BiBTex:

@InProceedings{10.1007/978-3-031-39059-3_17,
author="Bogdanoski, Konstantin
and Mishev, Kostadin
and Simjanoska, Monika
and Trajanov, Dimitar",
editor="Conte, Donatello
and Fred, Ana
and Gusikhin, Oleg
and Sansone, Carlo",
title="Exploring ASR Models in Low-Resource Languages: Use-Case the Macedonian Language",
booktitle="Deep Learning Theory and Applications",
year="2023",
publisher="Springer Nature Switzerland",
address="Cham",
pages="254--268",
abstract="We explore the use of Wav2Vec 2.0, NeMo, and ESPNet models trained on a dataset in Macedonian language for the development of Automatic Speech Recognition (ASR) models for low-resource languages. The study aims to evaluate the performance of recent state-of-the-art models for speech recognition in low-resource languages, such as Macedonian, where there are limited resources available for training or fine-tuning. The paper presents a methodology used for data collection and preprocessing, as well as the details of the three architectures used in the study. The study evaluates the performance of each model using WER and CER metrics and provides a comparative analysis of the results. The findings of the research showed that Wav2Vec 2.0 outperformed the other models for the Macedonian language with a WER of 0.21, and CER of 0.09, however, NeMo and ESPNet models are still good candidates for creating ASR tools for low-resource languages such as Macedonian. The research presented provides insights into the effectiveness of different models for ASR in low-resource languages and highlights the potentials for using these models to develop ASR tools for other languages in the future. These findings have significant implications for the development of ASR tools for other low-resource languages in the future, and can potentially improve accessibility to speech recognition technology for individuals and communities who speak these languages.",
isbn="978-3-031-39059-3"
}

Konstantin-Bogdanoski
/

wav2vec2-macedonian-base