📝 fix link to Bayartsogt's HF profile

5e728f8 almost 2 years ago

5.37 kB

	---
	language:
	- pt
	license: apache-2.0
	tags:
	- whisper-event
	- generated_from_trainer
	datasets:
	- mozilla-foundation/common_voice_11_0
	metrics:
	- wer
	model-index:
	- name: Whisper Medium Portuguese
	results:
	- task:
	name: Automatic Speech Recognition
	type: automatic-speech-recognition
	dataset:
	name: mozilla-foundation/common_voice_11_0 pt
	type: mozilla-foundation/common_voice_11_0
	config: pt
	split: test
	args: pt
	metrics:
	- name: Wer
	type: wer
	value: 6.5785713084850626
	---

	# Whisper Medium Portuguese 🇧🇷🇵🇹

	Bem-vindo ao whisper medium para transcrição em português 👋🏻

	If you are looking to quickly, and reliably, transcribe Portuguese audio to text, you are in the right place!

	With a state-of-the-art [Word Error Rate](https://huggingface.co/spaces/evaluate-metric/wer) (WER) of just 6.579 in Common Voice 11, this model offers an x2 precision increase compared to prior state-of-the-art [wav2vec2](https://huggingface.co/Edresson/wav2vec2-large-xlsr-coraa-portuguese) models. Compared to the original [whisper-medium](https://huggingface.co/openai/whisper-medium) model it delivers an x1.2 improvement 🚀.

	This model is a fine-tuned version of [openai/whisper-medium](https://huggingface.co/openai/whisper-medium) on the [mozilla-foundation/common_voice_11](https://huggingface.co/datasets/mozilla-foundation/common_voice_11_0) dataset.

	The following table displays a comparison between the results of our model and those achieved by the most downloaded models in the hub for [Portuguese Automatic Speech Recognition](https://huggingface.co/models?language=pt&pipeline_tag=automatic-speech-recognition&sort=downloads) 🗣:

	\| Model \| WER \| Parameters \|
	\|--------------------------------------------------\|:--------:\|:------------:\|
	\| [openai/whisper-medium](https://huggingface.co/openai/whisper-medium) \| 8.100 \| 769M \|
	\| [jlondonobo/whisper-medium-pt](https://huggingface.co/jlondonobo/whisper-medium-pt) \| 6.579 🤗 \| 769M \|
	\| [jonatasgrosman/wav2vec2-large-xlsr-53-portuguese](https://huggingface.co/jonatasgrosman/wav2vec2-large-xlsr-53-portuguese) \| 11.310 \| 317M \|
	\| [Edresson/wav2vec2-large-xlsr-coraa-portuguese](https://huggingface.co/Edresson/wav2vec2-large-xlsr-coraa-portuguese) \| 20.080 \| 317M \|


	### How to use
	You can use this model directly with a pipeline. This is especially useful for short audio. For long-form transcriptions please use the code in the [Long-form transcription](#long-form-transcription) section.

	```bash
	pip install git+https://github.com/huggingface/transformers --force-reinstall
	pip install torch
	```

	```python
	>>> from transformers import pipeline
	>>> import torch

	>>> device = 0 if torch.cuda.is_available() else "cpu"

	# Load the pipeline
	>>> transcribe = pipeline(
	... task="automatic-speech-recognition",
	... model="jlondonobo/whisper-medium-pt",
	... chunk_length_s=30,
	... device=device,
	... )

	# Force model to transcribe in Portuguese
	>>> transcribe.model.config.forced_decoder_ids = transcribe.tokenizer.get_decoder_prompt_ids(language="pt", task="transcribe")

	# Transcribe your audio file
	>>> transcribe("audio.m4a")["text"]
	'Eu falo português.'
	```

	#### Long-form transcription
	To improve the performance of long-form transcription you can convert the HF model into a `whisper` model, and use the original paper's matching algorithm. To do this, you must install `whisper` and a set of tools developed by [@bayartsogt](https://huggingface.co/bayartsogt).
	```bash
	pip install git+https://github.com/openai/whisper.git
	pip install git+https://github.com/bayartsogt-ya/whisper-multiple-hf-datasets
	```

	Then convert the HuggingFace model and transcribe:
	```python
	>>> import torch
	>>> import whisper
	>>> from multiple_datasets.hub_default_utils import convert_hf_whisper

	>>> device = "cuda" if torch.cuda.is_available() else "cpu"

	# Write HF model to local whisper model
	>>> convert_hf_whisper("jlondonobo/whisper-medium-pt", "local_whisper_model.pt")

	# Load the whisper model
	>>> model = whisper.load_model("local_whisper_model.pt", device=device)

	# Transcribe arbitrarily long audio
	>>> model.transcribe("long_audio.m4a", language="pt")["text"]
	'Olá eu sou o José. Tenho 23 anos e trabalho...'
	```


	### Training hyperparameters
	We used the following hyperparameters for training:
	- `learning_rate`: 1e-05
	- `train_batch_size`: 32
	- `eval_batch_size`: 16
	- `seed`: 42
	- `optimizer`: Adam with betas=(0.9,0.999) and epsilon=1e-08
	- `lr_scheduler_type`: linear
	- `lr_scheduler_warmup_steps`: 500
	- `training_steps`: 5000
	- `mixed_precision_training`: Native AMP

	### Training results

	\| Training Loss \| Epoch \| Step \| Validation Loss \| Wer \|
	\|:-------------:\|:-----:\|:----:\|:---------------:\|:------:\|
	\| 0.0698 \| 1.09 \| 1000 \| 0.1876 \| 7.189 \|
	\| 0.0218 \| 3.07 \| 2000 \| 0.2254 \| 7.110 \|
	\| 0.0053 \| 5.06 \| 3000 \| 0.2711 \| 6.969 \|
	\| 0.0017 \| 7.04 \| 4000 \| 0.3030 \| 6.686 \|
	\| 0.0005 \| 9.02 \| 5000 \| 0.3205 \| 6.579 🤗 \|


	### Framework versions

	- Transformers 4.26.0.dev0
	- Pytorch 1.13.0+cu117
	- Datasets 2.7.1.dev0
	- Tokenizers 0.13.2