metadata
language:
- en
library_name: transformers
pipeline_tag: automatic-speech-recognition
Model trained in int8 with LoRA
Usage:
prepare pipeline, providing any custom generate_kwargs supprted by https://huggingface.co/docs/transformers/v4.40.0/en/main_classes/text_generation#transformers.GenerationConfig
asr_model=prepare_pipeline(
model_dir='.', # wherever you save the model
generate_kwargs={
'max_new_tokens':112,
'num_beams':1,
'repetition_penalty':1,
'do_sample':False
}
)
run ASR:
asr_model(audio_path)
run ASR on full directory in audio_dir
:
If generate_kwargs not specified, will give you (deterministic) greedy decoding with up to 112 tokens generated, no repetition penalty
ASRdirWhisat(
audio_dir,
out_dir = '../whisat_results/',
model_dir=".",
)
Training information:
- Training script: tune_hf_whisper.py
- Training hyperparameters: hparams.yaml
- Training data manifest: PUBLIC_KIDS_TRAIN_v4_deduped.csv
Note: to recreate this training you will need to acquire the following public datasets:
- MyST (myst-v0.4.2)
- CuKids
- CSLU
and ensure they are stored at paths consistend with those in the data manifest above.
Reference:
@inproceedings{southwell2024,
title={Automatic speech recognition tuned for child speech in the classroom},
author={ Southwell, Rosy and Ward , Wayne and Trinh , Viet Anh and Clevenger, Charis and Clevenger, Clay and Watts, Emily and Reitman, Jason and D’Mello, Sidney and Whitehill, Jacob},
booktitle={{IEEE} International Conference on Acoustics, Speech and Signal Processing
{ICASSP} 2024, Seoul, South Korea, April 14-19, 2024},
year={2024},
}