--- license: apache-2.0 datasets: - mozilla-foundation/common_voice_17_0 - mozilla-foundation/common_voice_13_0 language: - hi metrics: - wer base_model: - theainerd/Wav2Vec2-large-xlsr-hindi pipeline_tag: automatic-speech-recognition library_name: transformers --- # Model's Improvment This model card highlights the improvements from the base model, specifically a reduction in WER from 72% to 54%. This improvement reflects the efficacy of the fine-tuning process on Hindi speech data. # Wav2Vec2-Large-XLSR-Hindi-Finetuned - Yash_Ratnaker This model is a fine-tuned version of [theainerd/Wav2Vec2-large-xlsr-hindi](https://huggingface.co/theainerd/Wav2Vec2-large-xlsr-hindi) on the Common Voice 13 and 17 datasets. It is specifically optimized for Hindi speech recognition, with a notable improvement in transcription accuracy, achieving a **Word Error Rate (WER) of 54%**, compared to the base model’s WER of 72% on the same dataset. ## Model description This Wav2Vec2 model, originally developed by Facebook AI, utilizes self-supervised learning on large unlabeled speech datasets and is then fine-tuned on labeled data. This approach enables the model to learn intricate linguistic features and transcribe speech in Hindi with high accuracy. Fine-tuning on Common Voice Hindi data allows the model to better capture the language's nuances, improving transcription quality. ## Intended uses & limitations This model is ideal for automatic speech recognition (ASR) applications in Hindi, such as media transcription, accessibility services, and educational content transcription, where audio quality is controlled. ## Usage The model can be used directly (without a language model) as follows: import torch import torchaudio from datasets import load_dataset from transformers import Wav2Vec2ForCTC, Wav2Vec2Processor # Load the Hindi Common Voice dataset test_dataset = load_dataset("common_voice", "hi", split="test[:2%]") # Load the processor and model processor = Wav2Vec2Processor.from_pretrained("yash072/wav2vec2-large-xlsr-YashHindi-4") model = Wav2Vec2ForCTC.from_pretrained("yash072/wav2vec2-large-xlsr-YashHindi-4") resampler = torchaudio.transforms.Resample(48_000, 16_000) # Function to process the dataset def speech_file_to_array_fn(batch): speech_array, sampling_rate = torchaudio.load(batch["path"]) batch["speech"] = resampler(speech_array).squeeze().numpy() return batch test_dataset = test_dataset.map(speech_file_to_array_fn) inputs = processor(test_dataset["speech"][:2], sampling_rate=16_000, return_tensors="pt", padding=True) # Perform inference with torch.no_grad(): logits = model(inputs.input_values, attention_mask=inputs.attention_mask).logits predicted_ids = torch.argmax(logits, dim=-1) print("Prediction:", processor.batch_decode(predicted_ids)) print("Reference:", test_dataset["sentence"][:2]) # Evaluation The model can be evaluated as follows on the Hindi test data of Common Voice. import torch import torchaudio from datasets import load_dataset, load_metric from transformers import Wav2Vec2ForCTC, Wav2Vec2Processor import re # Load the dataset and metrics test_dataset = load_dataset("common_voice", "hi", split="test") wer = load_metric("wer") # Initialize processor and model processor = Wav2Vec2Processor.from_pretrained("yash072/wav2vec2-large-xlsr-YashHindi-4") model = Wav2Vec2ForCTC.from_pretrained("yash072/wav2vec2-large-xlsr-YashHindi-4") model.to("cuda") resampler = torchaudio.transforms.Resample(48_000, 16_000) chars_to_ignore_regex = '[\,\?\.\!\-\;\:\"\“]' # Function to preprocess the data def speech_file_to_array_fn(batch): batch["sentence"] = re.sub(chars_to_ignore_regex, '', batch["sentence"]).lower() speech_array, sampling_rate = torchaudio.load(batch["path"]) batch["speech"] = resampler(speech_array).squeeze().numpy() return batch test_dataset = test_dataset.map(speech_file_to_array_fn) # Evaluation function def evaluate(batch): inputs = processor(batch["speech"], sampling_rate=16_000, return_tensors="pt", padding=True) with torch.no_grad(): logits = model(inputs.input_values.to("cuda"), attention_mask=inputs.attention_mask.to("cuda")).logits pred_ids = torch.argmax(logits, dim=-1) batch["pred_strings"] = processor.batch_decode(pred_ids) return batch result = test_dataset.map(evaluate, batched=True, batch_size=8) print("WER: {:.2f}".format(100 * wer.compute(predictions=result["pred_strings"], references=result["sentence"]))) ### Limitations: - The model may face challenges with dialectal or regional variations within Hindi. - Performance can degrade with noisy audio or overlapping speech. - It is not intended for real-time transcription due to latency considerations. ## Training and evaluation data The model was fine-tuned on the Hindi portions of the Common Voice 13 and 17 datasets, which contain speech samples from native Hindi speakers. This data captures a range of accents, pronunciations, and recording conditions, enhancing the model’s ability to generalize across different speech patterns. Evaluation was performed on a carefully curated subset, ensuring a reliable benchmark for ASR performance in Hindi. ## Training procedure ### Hyperparameters and setup: The following hyperparameters were used during training: - **Learning rate**: 1e-4 - **Batch size**: 16 (per device) - **Gradient accumulation steps**: 2 - **Evaluation strategy**: steps - **Max steps**: 2500 - **Mixed precision**: FP16 - **Save steps**: 500 - **Evaluation steps**: 500 - **Logging steps**: 500 - **Warmup steps**: 500 - **Save total limit**: 1 ### Training output - **Global step**: 2500 - **Training runtime**: Approximately 1 hour 21 minutes - **Epochs**: 5-6 ### Training results | Step | Training Loss | Validation Loss | WER | |------|---------------|-----------------|--------| | 500 | 5.603000 | 0.987691 | 0.7556 | | 1000 | 0.720300 | 0.667561 | 0.6196 | | 1500 | 0.507000 | 0.592814 | 0.5844 | | 2000 | 0.431100 | 0.549786 | 0.5439 | | 2500 | 0.395600 | 0.537703 | 0.5428 | ### Framework versions Transformers: 4.42.4 PyTorch: 2.3.1+cu121 Datasets: 2.20.0 Tokenizers: 0.19.1