|
--- |
|
license: apache-2.0 |
|
datasets: |
|
- mozilla-foundation/common_voice_17_0 |
|
- mozilla-foundation/common_voice_13_0 |
|
language: |
|
- hi |
|
metrics: |
|
- wer |
|
base_model: |
|
- theainerd/Wav2Vec2-large-xlsr-hindi |
|
pipeline_tag: automatic-speech-recognition |
|
library_name: transformers |
|
--- |
|
# Model's Improvment |
|
|
|
This model card highlights the improvements from the base model, specifically a reduction in WER from 72% to 54%. This improvement reflects the efficacy of the fine-tuning process on Hindi speech data. |
|
|
|
# Wav2Vec2-Large-XLSR-Hindi-Finetuned - Yash_Ratnaker |
|
|
|
This model is a fine-tuned version of [theainerd/Wav2Vec2-large-xlsr-hindi](https://huggingface.co/theainerd/Wav2Vec2-large-xlsr-hindi) on the Common Voice 13 and 17 datasets. It is specifically optimized for Hindi speech recognition, with a notable improvement in transcription accuracy, achieving a **Word Error Rate (WER) of 54%**, compared to the base model’s WER of 72% on the same dataset. |
|
|
|
## Model description |
|
|
|
This Wav2Vec2 model, originally developed by Facebook AI, utilizes self-supervised learning on large unlabeled speech datasets and is then fine-tuned on labeled data. This approach enables the model to learn intricate linguistic features and transcribe speech in Hindi with high accuracy. Fine-tuning on Common Voice Hindi data allows the model to better capture the language's nuances, improving transcription quality. |
|
|
|
## Intended uses & limitations |
|
|
|
This model is ideal for automatic speech recognition (ASR) applications in Hindi, such as media transcription, accessibility services, and educational content transcription, where audio quality is controlled. |
|
|
|
|
|
## Usage |
|
|
|
The model can be used directly (without a language model) as follows: |
|
|
|
import torch |
|
import torchaudio |
|
from datasets import load_dataset |
|
from transformers import Wav2Vec2ForCTC, Wav2Vec2Processor |
|
|
|
# Load the Hindi Common Voice dataset |
|
test_dataset = load_dataset("common_voice", "hi", split="test[:2%]") |
|
|
|
# Load the processor and model |
|
processor = Wav2Vec2Processor.from_pretrained("yash072/wav2vec2-large-xlsr-YashHindi-4") |
|
model = Wav2Vec2ForCTC.from_pretrained("yash072/wav2vec2-large-xlsr-YashHindi-4") |
|
resampler = torchaudio.transforms.Resample(48_000, 16_000) |
|
|
|
# Function to process the dataset |
|
def speech_file_to_array_fn(batch): |
|
speech_array, sampling_rate = torchaudio.load(batch["path"]) |
|
batch["speech"] = resampler(speech_array).squeeze().numpy() |
|
return batch |
|
|
|
test_dataset = test_dataset.map(speech_file_to_array_fn) |
|
inputs = processor(test_dataset["speech"][:2], sampling_rate=16_000, return_tensors="pt", padding=True) |
|
|
|
# Perform inference |
|
with torch.no_grad(): |
|
logits = model(inputs.input_values, attention_mask=inputs.attention_mask).logits |
|
|
|
predicted_ids = torch.argmax(logits, dim=-1) |
|
print("Prediction:", processor.batch_decode(predicted_ids)) |
|
print("Reference:", test_dataset["sentence"][:2]) |
|
|
|
# Evaluation |
|
The model can be evaluated as follows on the Hindi test data of Common Voice. |
|
|
|
import torch |
|
import torchaudio |
|
from datasets import load_dataset, load_metric |
|
from transformers import Wav2Vec2ForCTC, Wav2Vec2Processor |
|
import re |
|
|
|
# Load the dataset and metrics |
|
test_dataset = load_dataset("common_voice", "hi", split="test") |
|
wer = load_metric("wer") |
|
|
|
# Initialize processor and model |
|
processor = Wav2Vec2Processor.from_pretrained("yash072/wav2vec2-large-xlsr-YashHindi-4") |
|
model = Wav2Vec2ForCTC.from_pretrained("yash072/wav2vec2-large-xlsr-YashHindi-4") |
|
model.to("cuda") |
|
|
|
resampler = torchaudio.transforms.Resample(48_000, 16_000) |
|
chars_to_ignore_regex = '[\,\?\.\!\-\;\:\"\“]' |
|
|
|
# Function to preprocess the data |
|
def speech_file_to_array_fn(batch): |
|
batch["sentence"] = re.sub(chars_to_ignore_regex, '', batch["sentence"]).lower() |
|
speech_array, sampling_rate = torchaudio.load(batch["path"]) |
|
batch["speech"] = resampler(speech_array).squeeze().numpy() |
|
return batch |
|
|
|
test_dataset = test_dataset.map(speech_file_to_array_fn) |
|
|
|
# Evaluation function |
|
def evaluate(batch): |
|
inputs = processor(batch["speech"], sampling_rate=16_000, return_tensors="pt", padding=True) |
|
with torch.no_grad(): |
|
logits = model(inputs.input_values.to("cuda"), attention_mask=inputs.attention_mask.to("cuda")).logits |
|
pred_ids = torch.argmax(logits, dim=-1) |
|
batch["pred_strings"] = processor.batch_decode(pred_ids) |
|
return batch |
|
|
|
result = test_dataset.map(evaluate, batched=True, batch_size=8) |
|
print("WER: {:.2f}".format(100 * wer.compute(predictions=result["pred_strings"], references=result["sentence"]))) |
|
|
|
|
|
|
|
### Limitations: |
|
- The model may face challenges with dialectal or regional variations within Hindi. |
|
- Performance can degrade with noisy audio or overlapping speech. |
|
- It is not intended for real-time transcription due to latency considerations. |
|
|
|
## Training and evaluation data |
|
|
|
The model was fine-tuned on the Hindi portions of the Common Voice 13 and 17 datasets, which contain speech samples from native Hindi speakers. This data captures a range of accents, pronunciations, and recording conditions, enhancing the model’s ability to generalize across different speech patterns. Evaluation was performed on a carefully curated subset, ensuring a reliable benchmark for ASR performance in Hindi. |
|
|
|
## Training procedure |
|
|
|
### Hyperparameters and setup: |
|
|
|
The following hyperparameters were used during training: |
|
- **Learning rate**: 1e-4 |
|
- **Batch size**: 16 (per device) |
|
- **Gradient accumulation steps**: 2 |
|
- **Evaluation strategy**: steps |
|
- **Max steps**: 2500 |
|
- **Mixed precision**: FP16 |
|
- **Save steps**: 500 |
|
- **Evaluation steps**: 500 |
|
- **Logging steps**: 500 |
|
- **Warmup steps**: 500 |
|
- **Save total limit**: 1 |
|
|
|
### Training output |
|
|
|
- **Global step**: 2500 |
|
- **Training runtime**: Approximately 1 hour 21 minutes |
|
- **Epochs**: 5-6 |
|
|
|
### Training results |
|
|
|
| Step | Training Loss | Validation Loss | WER | |
|
|------|---------------|-----------------|--------| |
|
| 500 | 5.603000 | 0.987691 | 0.7556 | |
|
| 1000 | 0.720300 | 0.667561 | 0.6196 | |
|
| 1500 | 0.507000 | 0.592814 | 0.5844 | |
|
| 2000 | 0.431100 | 0.549786 | 0.5439 | |
|
| 2500 | 0.395600 | 0.537703 | 0.5428 | |
|
|
|
### Framework versions |
|
Transformers: 4.42.4 |
|
PyTorch: 2.3.1+cu121 |
|
Datasets: 2.20.0 |
|
Tokenizers: 0.19.1 |