|
--- |
|
extra_gated_prompt: "This is a BETA-model. To use this model, you agree on the [licensing terms](license.md)." |
|
language: |
|
- 'no' |
|
license: apache-2.0 |
|
tags: |
|
- audio |
|
- asr |
|
- automatic-speech-recognition |
|
- hf-asr-leaderboard |
|
model-index: |
|
- name: Small Scream - April Beta |
|
results: [] |
|
--- |
|
|
|
# Small Scream - April Beta |
|
This model is a fine-tuned version of [openai/whisper-small](https://huggingface.co/openai/whisper-small) on the NbAiLab/NCC_speech_all_v5 dataset. It uses a Beam size of 5. |
|
It achieves the following results on the evaluation set: |
|
- step: 49999 |
|
- eval_loss: 0.5299 |
|
- train_loss: 0.3369 |
|
- eval_wer: 11.9976 |
|
- eval_cer: 5.6236 |
|
|
|
## Model description |
|
|
|
This is a BETA version. You need to accept [the terms and conditons](license.md) to use it. |
|
|
|
|
|
## Using the Model |
|
There are several ways of using this model, and we do hope people will convert it into different formats. The code below allows you to process long files with Transformers.: |
|
|
|
```python |
|
import torch |
|
import numpy as np |
|
import librosa |
|
from transformers import pipeline |
|
|
|
# Try using "mps" for Metal (Mac), "cuda" if you have GPU, and "cpu" if not |
|
device = torch.device("cuda") |
|
|
|
pipe = pipeline("automatic-speech-recognition", |
|
model="NbAiLab/small_scream_april_beta", |
|
chunk_length_s=30, |
|
device=device, |
|
max_new_tokens=128, |
|
generate_kwargs={"language": "", "task": "transcribe"}) |
|
|
|
# Load the WAV file. Modify this to use mp3 instead |
|
audio_path = 'myfile.wav' |
|
samples, sample_rate = librosa.load(audio_path, sr=16000, mono=True) |
|
|
|
# Run the pipeline |
|
prediction = pipe(samples)["text"] |
|
|
|
print(prediction) |
|
|
|
``` |
|
|
|
### Training hyperparameters |
|
|
|
The following hyperparameters were used during training: |
|
- learning_rate: 6e-06 |
|
- lr_scheduler_type: linear |
|
- per_device_train_batch_size: 16 |
|
- total_train_batch_size_per_node: 64 |
|
- total_train_batch_size: 64 |
|
- total_optimization_steps: 50000 |
|
- starting_optimization_step: None |
|
- finishing_optimization_step: 50000 |
|
- num_train_dataset_workers: 32 |
|
- total_num_training_examples: 3200000 |
|
|
|
### Training results |
|
|
|
| step | eval_loss | train_loss | eval_wer | eval_cer | |
|
|:-----:|:---------:|:----------:|:--------:|:--------:| |
|
| 0 | 1.5034 | 1.6162 | 32.7040 | 11.9022 | |
|
| 0 | 1.5034 | 1.6129 | 32.7040 | 11.9022 | |
|
| 0 | 1.5034 | 1.5855 | 32.7040 | 11.9022 | |
|
| 2500 | 0.9684 | 0.6679 | 21.7113 | 8.0826 | |
|
| 5000 | 0.8986 | 0.6577 | 18.6358 | 7.3167 | |
|
| 7500 | 0.7365 | 0.4619 | 16.3825 | 6.8027 | |
|
| 10000 | 0.6429 | 0.4965 | 14.7990 | 6.2887 | |
|
| 12500 | 0.6688 | 0.4602 | 13.7942 | 6.0217 | |
|
| 15000 | 0.6509 | 0.4650 | 13.3069 | 5.9965 | |
|
| 17500 | 0.5692 | 0.3979 | 12.8502 | 5.6790 | |
|
| 20000 | 0.5530 | 0.3931 | 13.0938 | 5.8554 | |
|
| 22500 | 0.5320 | 0.4441 | 12.5457 | 5.7596 | |
|
| 25000 | 0.5109 | 0.4116 | 12.7893 | 5.8503 | |
|
| 27500 | 0.4855 | 0.3728 | 12.9111 | 5.8856 | |
|
| 30000 | 0.4720 | 0.3842 | 12.6066 | 5.8201 | |
|
| 32500 | 0.4889 | 0.3051 | 12.4239 | 5.7244 | |
|
| 35000 | 0.5312 | 0.3388 | 12.6066 | 5.9259 | |
|
| 37500 | 0.5138 | 0.3409 | 12.3934 | 5.7999 | |
|
| 40000 | 0.5214 | 0.2886 | 11.9367 | 5.5530 | |
|
| 42500 | 0.5420 | 0.3431 | 12.6675 | 5.9914 | |
|
| 45000 | 0.5263 | 0.4015 | 12.3934 | 5.9360 | |
|
| 47500 | 0.5378 | 0.3218 | 12.1194 | 5.6185 | |
|
| 49999 | 0.5299 | 0.3369 | 11.9976 | 5.6236 | |
|
|
|
|
|
### Framework versions |
|
|
|
- Transformers 4.28.0.dev0 |
|
- Datasets 2.11.0 |
|
- Tokenizers 0.13.3 |
|
|