yash072's picture
Update README.md
c570dbc verified
---
license: apache-2.0
datasets:
- mozilla-foundation/common_voice_17_0
- mozilla-foundation/common_voice_13_0
language:
- hi
metrics:
- wer
base_model:
- theainerd/Wav2Vec2-large-xlsr-hindi
pipeline_tag: automatic-speech-recognition
library_name: transformers
---
# Model's Improvment
This model card highlights the improvements from the base model, specifically a reduction in WER from 72% to 54%. This improvement reflects the efficacy of the fine-tuning process on Hindi speech data.
# Wav2Vec2-Large-XLSR-Hindi-Finetuned - Yash_Ratnaker
This model is a fine-tuned version of [theainerd/Wav2Vec2-large-xlsr-hindi](https://huggingface.co/theainerd/Wav2Vec2-large-xlsr-hindi) on the Common Voice 13 and 17 datasets. It is specifically optimized for Hindi speech recognition, with a notable improvement in transcription accuracy, achieving a **Word Error Rate (WER) of 54%**, compared to the base model’s WER of 72% on the same dataset.
## Model description
This Wav2Vec2 model, originally developed by Facebook AI, utilizes self-supervised learning on large unlabeled speech datasets and is then fine-tuned on labeled data. This approach enables the model to learn intricate linguistic features and transcribe speech in Hindi with high accuracy. Fine-tuning on Common Voice Hindi data allows the model to better capture the language's nuances, improving transcription quality.
## Intended uses & limitations
This model is ideal for automatic speech recognition (ASR) applications in Hindi, such as media transcription, accessibility services, and educational content transcription, where audio quality is controlled.
## Usage
The model can be used directly (without a language model) as follows:
import torch
import torchaudio
from datasets import load_dataset
from transformers import Wav2Vec2ForCTC, Wav2Vec2Processor
# Load the Hindi Common Voice dataset
test_dataset = load_dataset("common_voice", "hi", split="test[:2%]")
# Load the processor and model
processor = Wav2Vec2Processor.from_pretrained("yash072/wav2vec2-large-xlsr-YashHindi-4")
model = Wav2Vec2ForCTC.from_pretrained("yash072/wav2vec2-large-xlsr-YashHindi-4")
resampler = torchaudio.transforms.Resample(48_000, 16_000)
# Function to process the dataset
def speech_file_to_array_fn(batch):
speech_array, sampling_rate = torchaudio.load(batch["path"])
batch["speech"] = resampler(speech_array).squeeze().numpy()
return batch
test_dataset = test_dataset.map(speech_file_to_array_fn)
inputs = processor(test_dataset["speech"][:2], sampling_rate=16_000, return_tensors="pt", padding=True)
# Perform inference
with torch.no_grad():
logits = model(inputs.input_values, attention_mask=inputs.attention_mask).logits
predicted_ids = torch.argmax(logits, dim=-1)
print("Prediction:", processor.batch_decode(predicted_ids))
print("Reference:", test_dataset["sentence"][:2])
# Evaluation
The model can be evaluated as follows on the Hindi test data of Common Voice.
import torch
import torchaudio
from datasets import load_dataset, load_metric
from transformers import Wav2Vec2ForCTC, Wav2Vec2Processor
import re
# Load the dataset and metrics
test_dataset = load_dataset("common_voice", "hi", split="test")
wer = load_metric("wer")
# Initialize processor and model
processor = Wav2Vec2Processor.from_pretrained("yash072/wav2vec2-large-xlsr-YashHindi-4")
model = Wav2Vec2ForCTC.from_pretrained("yash072/wav2vec2-large-xlsr-YashHindi-4")
model.to("cuda")
resampler = torchaudio.transforms.Resample(48_000, 16_000)
chars_to_ignore_regex = '[\,\?\.\!\-\;\:\"\“]'
# Function to preprocess the data
def speech_file_to_array_fn(batch):
batch["sentence"] = re.sub(chars_to_ignore_regex, '', batch["sentence"]).lower()
speech_array, sampling_rate = torchaudio.load(batch["path"])
batch["speech"] = resampler(speech_array).squeeze().numpy()
return batch
test_dataset = test_dataset.map(speech_file_to_array_fn)
# Evaluation function
def evaluate(batch):
inputs = processor(batch["speech"], sampling_rate=16_000, return_tensors="pt", padding=True)
with torch.no_grad():
logits = model(inputs.input_values.to("cuda"), attention_mask=inputs.attention_mask.to("cuda")).logits
pred_ids = torch.argmax(logits, dim=-1)
batch["pred_strings"] = processor.batch_decode(pred_ids)
return batch
result = test_dataset.map(evaluate, batched=True, batch_size=8)
print("WER: {:.2f}".format(100 * wer.compute(predictions=result["pred_strings"], references=result["sentence"])))
### Limitations:
- The model may face challenges with dialectal or regional variations within Hindi.
- Performance can degrade with noisy audio or overlapping speech.
- It is not intended for real-time transcription due to latency considerations.
## Training and evaluation data
The model was fine-tuned on the Hindi portions of the Common Voice 13 and 17 datasets, which contain speech samples from native Hindi speakers. This data captures a range of accents, pronunciations, and recording conditions, enhancing the model’s ability to generalize across different speech patterns. Evaluation was performed on a carefully curated subset, ensuring a reliable benchmark for ASR performance in Hindi.
## Training procedure
### Hyperparameters and setup:
The following hyperparameters were used during training:
- **Learning rate**: 1e-4
- **Batch size**: 16 (per device)
- **Gradient accumulation steps**: 2
- **Evaluation strategy**: steps
- **Max steps**: 2500
- **Mixed precision**: FP16
- **Save steps**: 500
- **Evaluation steps**: 500
- **Logging steps**: 500
- **Warmup steps**: 500
- **Save total limit**: 1
### Training output
- **Global step**: 2500
- **Training runtime**: Approximately 1 hour 21 minutes
- **Epochs**: 5-6
### Training results
| Step | Training Loss | Validation Loss | WER |
|------|---------------|-----------------|--------|
| 500 | 5.603000 | 0.987691 | 0.7556 |
| 1000 | 0.720300 | 0.667561 | 0.6196 |
| 1500 | 0.507000 | 0.592814 | 0.5844 |
| 2000 | 0.431100 | 0.549786 | 0.5439 |
| 2500 | 0.395600 | 0.537703 | 0.5428 |
### Framework versions
Transformers: 4.42.4
PyTorch: 2.3.1+cu121
Datasets: 2.20.0
Tokenizers: 0.19.1