techiaith
/

wav2vec2-xlsr-53-ft-cy-en-withlm

@@ -14,48 +14,60 @@ language:
 pipeline_tag: automatic-speech-recognition
 ---
-<!-- This model card has been generated automatically according to the information the Trainer had access to. You
-should probably proofread and complete it, then remove this comment. -->
-# wav2vec2-xlsr-53-ft-ccv-en-cy
-A speech recognition acoustic model for Welsh and English, fine-tuned from [facebook/wav2vec2-large-xlsr-53](https://huggingface.co/facebook/wav2vec2-large-xlsr-53) using English/Welsh balanced data derived from version 11 of their respective Common Voice datasets (https://commonvoice.mozilla.org/cy/datasets). Custom bilingual Common Voice train/dev and test splits were built using the scripts at https://github.com/techiaith/docker-commonvoice-custom-splits-builder#introduction
-Source code and scripts for training wav2vec2-xlsr-ft-en-cy can be found at [https://github.com/techiaith/docker-wav2vec2-cy](https://github.com/techiaith/docker-wav2vec2-cy/blob/main/train/fine-tune/python/run_en_cy.sh).
 ## Usage
-The wav2vec2-xlsr-53-ft-ccv-en-cy model can be used directly as follows:
 ```python
 import torch
 import torchaudio
 import librosa
-from transformers import Wav2Vec2ForCTC, Wav2Vec2Processor
-processor = Wav2Vec2Processor.from_pretrained("techiaith/wav2vec2-xlsr-53-ft-ccv-en-cy")
-model = Wav2Vec2ForCTC.from_pretrained("techiaith/wav2vec2-xlsr-53-ft-ccv-en-cy")
-audio, rate = librosa.load(audio_file, sr=16000)
 inputs = processor(audio, sampling_rate=16_000, return_tensors="pt", padding=True)
 with torch.no_grad():
   tlogits = model(inputs.input_values, attention_mask=inputs.attention_mask).logits
-# greedy decoding
-predicted_ids = torch.argmax(logits, dim=-1)
-print("Prediction:", processor.batch_decode(predicted_ids))
 ```
 ## Evaluation
-According to a balanced English+Welsh test set derived from Common Voice version 16.1, the WER of techiaith/wav2vec2-xlsr-53-ft-ccv-en-cy is **23.79%**
 However, when evaluated with language specific test sets, the model exhibits a bias to perform better with Welsh.

 pipeline_tag: automatic-speech-recognition
 ---
+# wav2vec2-xlsr-53-ft-cy-en-withlm
+This model is a version of [facebook/wav2vec2-large-xlsr-53](https://huggingface.co/facebook/wav2vec2-large-xlsr-53)
+that has been fined-tuned with a custom bilingual datasets derived from the Welsh
+and English data releases of Mozilla Foundation's Commonvoice project. See : [techiaith/commonvoice_16_1_en_cy](https://huggingface.co/datasets/techiaith/commonvoice_16_1_en_cy).
+In addition, this model also includes a single KenLM n-gram model trained with balanced
+collections of Welsh and English texts from [OSCAR](https://huggingface.co/datasets/oscar)
+This avoids the need for any language detection for determining whether to use a Welsh or English n-gram models during CTC decoding.
 ## Usage
+The `wav2vec2-xlsr-53-ft-cy-en-withlm` model can be used directly as follows:
 ```python
 import torch
 import torchaudio
 import librosa
+from transformers import Wav2Vec2ForCTC, Wav2Vec2ProcessorWithLM
+processor = Wav2Vec2ProcessorWithLM.from_pretrained("techiaith/wav2vec2-xlsr-53-ft-cy-en-withlm")
+model = Wav2Vec2ForCTC.from_pretrained("techiaith/wav2vec2-xlsr-53-ft-cy-en-withlm")
+audio, rate = librosa.load(<path/to/audio_file>, sr=16000)
 inputs = processor(audio, sampling_rate=16_000, return_tensors="pt", padding=True)
 with torch.no_grad():
   tlogits = model(inputs.input_values, attention_mask=inputs.attention_mask).logits
+print("Prediction: ", processor.batch_decode(tlogits.numpy(), beam_width=10).text[0].strip())
+```
+Usage with a pipeline is even simpler...
+```
+from transformers import pipeline
+transcriber = pipeline("automatic-speech-recognition", model="techiaith/wav2vec2-xlsr-53-ft-cy-en-withlm")
+def transcribe(audio):
+    return transcriber(audio)["text"]
+transcribe(<path/or/url/to/any/audiofile>)
 ```
 ## Evaluation
+According to a balanced English+Welsh test set derived from Common Voice version 16.1, the WER of techiaith/wav2vec2-xlsr-53-ft-cy-en-withlm is **23.79%**
 However, when evaluated with language specific test sets, the model exhibits a bias to perform better with Welsh.