Update README.md

9849b8b verified 8 months ago

5.67 kB

	---
	license: apache-2.0
	tags:
	- generated_from_trainer
	metrics:
	- wer
	- cer
	model-index:
	- name: hubert-base-japanese-asr
	results:
	- task:
	name: Speech Recognition
	type: automatic-speech-recognition
	dataset:
	name: common_voice_11_0
	type: common_voice
	args: ja
	metrics:
	- name: Test WER
	type: wer
	value: 27.511982
	- name: Test CER
	type: cer
	value: 11.699897
	datasets:
	- mozilla-foundation/common_voice_11_0
	language:
	- ja
	---

	# hubert-base-asr

	This model is a fine-tuned version of [rinna/japanese-hubert-base](https://huggingface.co/rinna/japanese-hubert-base) on the [common_voice_11_0 dataset](https://huggingface.co/datasets/mozilla-foundation/common_voice_11_0/viewer/ja) for ASR tasks.

	This model can only predict Hiragana.

	## Acknowledgments

	This model's fine-tuning approach was inspired by and references the training methodology used in [vumichien/wav2vec2-large-xlsr-japanese-hiragana](https://huggingface.co/vumichien/wav2vec2-large-xlsr-japanese-hiragana).

	## Training Procedure

	Fine-tuning on the common_voice_11_0 dataset led to the following results:

	\| Step \| Training Loss \| Validation Loss \| WER \|
	\|-------\|---------------\|-----------------\|--------\|
	\| 1000 \| 2.505600 \| 1.009531 \| 0.614952\|
	\| 2000 \| 1.186900 \| 0.752440 \| 0.422948\|
	\| 3000 \| 0.947700 \| 0.658266 \| 0.358543\|
	\| 4000 \| 0.817700 \| 0.656034 \| 0.356308\|
	\| 5000 \| 0.741300 \| 0.623420 \| 0.314537\|
	\| 6000 \| 0.694700 \| 0.624534 \| 0.294018\|
	\| 7000 \| 0.653400 \| 0.603341 \| 0.286735\|
	\| 8000 \| 0.616200 \| 0.606606 \| 0.285132\|
	\| 9000 \| 0.594800 \| 0.596215 \| 0.277422\|
	\| 10000 \| 0.590500 \| 0.603380 \| 0.274949\|

	### Training hyperparameters

	The training hyperparameters remained consistent throughout the fine-tuning process:

	- learning_rate: 1e-4
	- train_batch_size: 16
	- eval_batch_size: 16
	- seed: 42
	- gradient_accumulation_steps: 2
	- num_train_epochs: 30
	- lr_scheduler_type: linear

	### How to evaluate the model

	```python
	from transformers import HubertForCTC, Wav2Vec2Processor
	from datasets import load_dataset
	import torchaudio
	import librosa
	import numpy as np
	import re
	import MeCab
	import pykakasi
	from evaluate import load

	model = HubertForCTC.from_pretrained('TKU410410103/hubert-base-japanese-asr')
	processor = Wav2Vec2Processor.from_pretrained("TKU410410103/hubert-base-japanese-asr")

	# load dataset
	test_dataset = load_dataset('mozilla-foundation/common_voice_11_0', 'ja', split='test')
	remove_columns = [col for col in test_dataset.column_names if col not in ['audio', 'sentence']]
	test_dataset = test_dataset.remove_columns(remove_columns)

	# resample
	def process_waveforms(batch):
	speech_arrays = []
	sampling_rates = []

	for audio_path in batch['audio']:
	speech_array, _ = torchaudio.load(audio_path['path'])
	speech_array_resampled = librosa.resample(np.asarray(speech_array[0].numpy()), orig_sr=48000, target_sr=16000)
	speech_arrays.append(speech_array_resampled)
	sampling_rates.append(16000)

	batch["array"] = speech_arrays
	batch["sampling_rate"] = sampling_rates

	return batch

	# hiragana
	CHARS_TO_IGNORE = [",", "?", "¿", ".", "!", "¡", ";", "；", ":", '""', "%", '"', "�", "ʿ", "·", "჻", "~", "՞",
	"؟", "،", "।", "॥", "«", "»", "„", "“", "”", "「", "」", "‘", "’", "《", "》", "(", ")", "[", "]",
	"{", "}", "=", "`", "_", "+", "<", ">", "…", "–", "°", "´", "ʾ", "‹", "›", "©", "®", "—", "→", "。",
	"、", "﹂", "﹁", "‧", "～", "﹏", "，", "｛", "｝", "（", "）", "［", "］", "【", "】", "‥", "〽",
	"『", "』", "〝", "〟", "⟨", "⟩", "〜", "：", "！", "？", "♪", "؛", "/", "\\", "º", "−", "^", "'", "ʻ", "ˆ"]
	chars_to_ignore_regex = f"[{re.escape(''.join(CHARS_TO_IGNORE))}]"

	wakati = MeCab.Tagger("-Owakati")
	kakasi = pykakasi.kakasi()
	kakasi.setMode("J","H")
	kakasi.setMode("K","H")
	kakasi.setMode("r","Hepburn")
	conv = kakasi.getConverter()

	def prepare_char(batch):
	batch["sentence"] = conv.do(wakati.parse(batch["sentence"]).strip())
	batch["sentence"] = re.sub(chars_to_ignore_regex,'', batch["sentence"]).strip()
	return batch


	resampled_eval_dataset = test_dataset.map(process_waveforms, batched=True, batch_size=50, num_proc=4)
	eval_dataset = resampled_eval_dataset.map(prepare_char, num_proc=4)

	# begin the evaluation process
	wer = load("wer")
	cer = load("cer")

	def evaluate(batch):
	inputs = processor(batch["array"], sampling_rate=16_000, return_tensors="pt", padding=True)
	with torch.no_grad():
	logits = model(inputs.input_values.to(device), attention_mask=inputs.attention_mask.to(device)).logits
	pred_ids = torch.argmax(logits, dim=-1)
	batch["pred_strings"] = processor.batch_decode(pred_ids)
	return batch

	columns_to_remove = [column for column in eval_dataset.column_names if column != "sentence"]
	batch_size = 16
	result = eval_dataset.map(evaluate, remove_columns=columns_to_remove, batched=True, batch_size=batch_size)

	wer_result = wer.compute(predictions=result["pred_strings"], references=result["sentence"])
	cer_result = cer.compute(predictions=result["pred_strings"], references=result["sentence"])

	print("WER: {:2f}%".format(100 * wer_result))
	print("CER: {:2f}%".format(100 * cer_result))
	```

	### Test results
	The final model was evaluated as follows:

	On common_voice_11_0:
	- WER: 27.511982%
	- CER: 11.699897%

	### Framework versions

	- Transformers 4.39.1
	- Pytorch 2.2.1+cu118
	- Datasets 2.17.1