Bagus
/

speecht5_finetuned_commonvoice_id

Inference Endpoints

Model card Files Files and versions Metrics Training metrics Community

Bagus commited on Aug 29

Commit

d699ebc

•

1 Parent(s): 9b7bbc7

Update README.md

Files changed (1) hide show

README.md +68 -7

README.md CHANGED Viewed

@@ -21,19 +21,80 @@ This model is a fine-tuned version of [microsoft/speecht5_tts](https://huggingfa
 It achieves the following results on the evaluation set:
 - Loss: 0.4675
-## Model description
-More information needed
-## Intended uses & limitations
-More information needed
-## Training and evaluation data
-More information needed
-## Training procedure
 ### Training hyperparameters

 It achieves the following results on the evaluation set:
 - Loss: 0.4675
+## How to use/inference
+Follow the example below and adapt with your own need.
+```
+# ft_t5_id_inference.py
+import sounddevice as sd
+import torch
+import torchaudio
+from datasets import Audio, load_dataset
+from transformers import (
+    SpeechT5ForTextToSpeech,
+    SpeechT5HifiGan,
+    SpeechT5Processor,
+)
+from utils import create_speaker_embedding
+# load dataset and pre-trained model
+dataset = load_dataset(
+    "mozilla-foundation/common_voice_16_1", "id", split="test")
+model = SpeechT5ForTextToSpeech.from_pretrained(
+    "Bagus/speecht5_finetuned_commonvoice_id")
+# process the text using checkpoint
+checkpoint = "microsoft/speecht5_tts"
+processor = SpeechT5Processor.from_pretrained(checkpoint)
+sampling_rate = processor.feature_extractor.sampling_rate
+dataset = dataset.cast_column("audio", Audio(sampling_rate=sampling_rate))
+def prepare_dataset(example):
+    audio = example["audio"]
+    example = processor(
+        text=example["sentence"],
+        audio_target=audio["array"],
+        sampling_rate=audio["sampling_rate"],
+        return_attention_mask=False,
+    )
+    # strip off the batch dimension
+    example["labels"] = example["labels"][0]
+    # use SpeechBrain to obtain x-vector
+    example["speaker_embeddings"] = create_speaker_embedding(audio["array"])
+    return example
+# prepare the speaker embeddings from the dataset and text
+example = prepare_dataset(dataset[30])
+speaker_embeddings = torch.tensor(example["speaker_embeddings"]).unsqueeze(0)
+# prepare text to be converted to speech
+text = "Saya suka baju yang berwarna merah tua."
+inputs = processor(text=text, return_tensors="pt")
+vocoder = SpeechT5HifiGan.from_pretrained("microsoft/speecht5_hifigan")
+speech = model.generate_speech(
+    inputs["input_ids"], speaker_embeddings, vocoder=vocoder)
+sampling_rate = 16000
+sd.play(speech, samplerate=sampling_rate, blocking=True)
+# save the audio, signal needs to be in 2D tensor
+torchaudio.save("output_t5_ft_cv16_id.wav", speech.unsqueeze(0), 16000)
+```
 ### Training hyperparameters