Automatic Speech Recognition
Reverb
English
Edit model card

You need to agree to share your contact information to access this model

This repository is publicly accessible, but you have to accept the conditions to access its files and content.

Log in or Sign Up to review the conditions and access this model content.

Table of Contents

Getting Started

Details on the model, it's performance, and more available on Arxiv.

Clone the model

The Reverb ASR model v1 is stored in this model repository.

Install inference requirements

See our inference code at https://github.com/revdotcom/reverb/tree/main/asr

About

Rev’s Reverb ASR was trained on 200,000 hours of English speech, all expertly transcribed by humans - the largest corpus of human transcribed audio ever used to train an open-source model. The quality of this data has produced the world’s most accurate English automatic speech recognition (ASR) system, using an efficient model architecture that can be run on either CPU or GPU. Additionally, Reverb ASR provides user control over the level of verbatimicity of the output transcript, making it ideal for both clean, readable transcription and use-cases like audio editing that require transcription of every spoken word including hesitations and re-wordings. Users can specify fully verbatim, fully non-verbatim, or anywhere in between for their transcription output.

Code

The folder wenet is taken a fork of the WeNet repository, with some modifications made for Rev-specific architecture.

The folder wer_evaluation contains instructions and code for running different benchmark utlities. These scripts are not specific to the Reverb architecture.

Features

Transcription Style Options

Reverb ASR was trained to produce transcriptions in either a verbatim style, in which every word is transcribed as spoken; or a non-verbatim style, in which disfluencies may be removed from the transcript.

Users can specify Reverb ASR's output style with the verbatimicity parameter. 1 corresponds to a verbatim transcript and 0 corresponds to a non-verbatim transcript. Values between 0 and 1 are accepted and may correspond to a semi-non-verbatim style. See our demo here to test the verbatimicity parameter with your own audio.

Decoding Options

Reverb ASR uses the joint CTC/attention architecture described here and here, and supports multiple modes of decoding. Users can specify one or more modes of decoding to recognize_wav.py and separate output directories will be created for each decoding mode.

Decoding options are:

  • attention
  • ctc_greedy_search
  • ctc_prefix_beam_search
  • attention_rescoring
  • joint_decoding

Usage

python wenet/bin/recognize_wav.py --config model.yaml \
    --checkpoint model.pt \
    --audio hello_world.wav \
    --modes ctc_prefix_beam_search attention_rescoring \
    --gpu 0 \
    --verbatimicity 1.0

Or check out our demo on HuggingFace.

Benchmarking

See wer_evaluation folder of https://github.com/revdotcom/reverb/tree/main/asr for details and results.

Cite this Model

If you use this model please use the following citation:

@misc{bhandari2024reverbopensourceasrdiarization,
      title={Reverb: Open-Source ASR and Diarization from Rev}, 
      author={Nishchal Bhandari and Danny Chen and Miguel Ángel del Río Fernández and Natalie Delworth and Jennifer Drexler Fox and Migüel Jetté and Quinten McNamara and Corey Miller and Ondřej Novotný and Ján Profant and Nan Qin and Martin Ratajczak and Jean-Philippe Robichaud},
      year={2024},
      eprint={2410.03930},
      archivePrefix={arXiv},
      primaryClass={cs.CL},
      url={https://arxiv.org/abs/2410.03930}, 
}

Acknowledgments

Special thanks to the Wenet team for their work and for making it available under an open-source license.

License

See LICENSE for details.

Downloads last month
63
Inference Examples
Inference API (serverless) does not yet support reverb models for this pipeline type.

Spaces using Revai/reverb-asr 4