reazonspeech-nemo-v2

reazonspeech-nemo-v2 is an automatic speech recognition model trained on ReazonSpeech v2.0 corpus.

This model supports inference of long-form Japanese audio clips up to several hours.

Model Architecture

The model features an improved Conformer architecture from Fast Conformer with Linearly Scalable Attention for Efficient Speech Recognition.

Subword-based RNN-T model. The total parameter count is 619M.
Encoder uses Longformer attention with local context size of 256, and has a single global token.
Decoder has a vocabulary space of 3000 tokens constructed by SentencePiece unigram tokenizer.

We trained this model for 1 million steps using AdamW optimizer following Noam annealing schedule.

Usage

We recommend to use this model through our reazonspeech library.

from reazonspeech.nemo.asr import load_model, transcribe, audio_from_path

audio = audio_from_path("speech.wav")
model = load_model()
ret = transcribe(model, audio)
print(ret.text)

License

Apaceh Licence 2.0