Hibiki ASR Phonemizer

This model is a Phoneme Level Speech Recognition network, originally a fine-tuned version of openai/whisper-large-v3 on a mixture of Different Japanese datasets.

it can detect, transcribe and do the following:

non-speech sounds such as gasp, erotic moans, laughter, etc.
adding punctuations more faithfully.

a Grapheme decoder head (i.e outputting normal Japanese) will probably be trained as well. Though going directly from audio to Phonemes will result in a more accurate representation for Japanese.

Inference and Post-proc (Highly recommended to check the notebook below!)


# this function was borrowed and modified from Aaron Yinghao Li, the Author of StyleTTS paper.

from datasets import Dataset, Audio
from transformers import WhisperProcessor, WhisperForConditionalGeneration
import re
import pykakasi

kana_mapper = dict([
    ("ゔぁ","ba"),
          .
          .
          .
          etc. # Take a look at the Notebook for the whole code
    ("ぉ"," o"),
    ("ゎ"," ɯa"),
    ("ぉ"," o"),

    ("を","o")
])


def post_fix(text):
    orig = text

    for k, v in kana_mapper.items():
        text = text.replace(k, v)

    return text


processor = WhisperProcessor.from_pretrained("openai/whisper-large-v3")
model = WhisperForConditionalGeneration.from_pretrained("Respair/Hibiki_ASR_Phonemizer").to("cuda:0")

forced_decoder_ids = processor.get_decoder_prompt_ids(task="transcribe", language='japanese')




def convert_to_kana(text):
    kks = pykakasi.kakasi()


    def convert_word(word):
        result = kks.convert(word)
        return ''.join(item['hira'] for item in result)


    parts = re.split(r'([^\u3000-\u30ff\u3400-\u4dbf\u4e00-\u9fff]+)', text)


    converted_parts = [convert_word(part) if re.match(r'[\u3000-\u30ff\u3400-\u4dbf\u4e00-\u9fff]', part) else part for part in parts]

    return ''.join(converted_parts)


sample = Dataset.from_dict({"audio": ["/content/kl_chunk1987.wav"]}).cast_column("audio", Audio(16000))
sample = sample[0]['audio']

# Ensure the input features are on the same device as the model
input_features = processor(sample["array"], sampling_rate=sample["sampling_rate"], return_tensors="pt").input_features.to("cuda:0")

# generate token ids
predicted_ids = model.generate(input_features,forced_decoder_ids=forced_decoder_ids, repetition_penalty=1.2)
# decode token ids to text
transcription = processor.batch_decode(predicted_ids, skip_special_tokens=True)


# You can add your final adjustments here, it's better to write a dict though, but I'm just giving you a quick demonstration here.

if ' neɽitai ' in transcription[0]:
    transcription[0] = transcription[0].replace(' neɽitai ', "naɽitai")

if 'harɯdʑisama' in transcription[0]:
    transcription[0] = transcription[0].replace('harɯdʑisama', "arɯdʑisama")


if 'tɕabiʔto' in transcription[0]:
    transcription[0] = transcription[0].replace('tɕabiʔto', "tɕabiʔto")


if "ki ni ɕinai" in transcription[0]:
    transcription[0] = re.sub(r'(?<!\s)ki ni ɕinai', r' ki ni ɕinai', transcription[0])

if 'ʔt' in transcription[0]:
    transcription[0] = re.sub(r'(?<!\s)ʔt', r'ʔt', transcription[0])

if 'de aɽoɯ' in transcription[0]:
    transcription[0] = re.sub(r'(?<!\s)de aɽoɯ', r' de aɽoɯ', transcription[0])

if ".ʔ" in transcription[0]:
    transcription[0] = transcription[0].replace(".ʔ","..")

if "ʔ." in transcription[0]:
    transcription[0] = transcription[0].replace("ʔ.",".")

transcription[0] = convert_to_kana(transcription[0]) # Ensuring the model won't hallucinate and accidentally return kana / kanji.

post_fix(transcription[0].lstrip())

the Full code -> Notebook

Intended uses & limitations

No restrictions is imposed by me, but proceed at your own risk, The User (You) are entirely responisble for their actions.

Training and evaluation data

Japanese Common Voice 17
ehehe Corpus
Custom Game and Anime dataset (around 8 hours)

Training procedure

Training hyperparameters

The following hyperparameters were used during training:

learning_rate: 1e-05
train_batch_size: 24
eval_batch_size: 8
seed: 42
optimizer: Adam with betas=(0.9,0.999) and epsilon=1e-08
lr_scheduler_type: linear
lr_scheduler_warmup_steps: 500
training_steps: 5000

Compute and Duration

1x A100(40G)
64gb RAM
BF16
14hrs

Framework versions

Transformers 4.41.1
Pytorch 2.4.0+cu121
Datasets 2.19.1
Tokenizers 0.19.1

Respair
/

Hibiki_ASR_Phonemizer_v0.2