metadata
language:
- fr
tags:
- audio
- speech
- speaker-diarization
- medkit
- pyannote-audio
datasets:
- common_voice
- pxcorpus
- simsamu
metrics:
- der
Simsamu diarization pipeline
This repository contains a pretrained pyannote-audio diarization pipeline that was fine-tuned on the Simsamu dataset.
The pipeline uses a fine-tuned segmentation model based on https://huggingface.co/pyannote/segmentation-3.0 and pretrained embeddings from https://huggingface.co/pyannote/wespeaker-voxceleb-resnet34-LM. The pipeline hyperparameters were optimized.
The pipeline can be used in medkit the following way:
from medkit.core.audio import AudioDocument
from medkit.audio.segmentation.pa_speaker_detector import PASpeakerDetector
# init speaker detector operation
speaker_detector = PASpeakerDetector(
model="medkit/simsamu-diarization",
device=0,
segmentation_batch_size=10,
embedding_batch_size=10,
)
# create audio document
audio_doc = AudioDocument.from_file("path/to/audio.wav")
# apply operation on audio document
speech_segments = speaker_detector.run([audio_doc.raw_segment])
# display each speech turn and corresponding speaker
for speech_seg in speech_segments:
speaker_attr = speech_seg.attrs.get(label="speaker")[0]
print(speech_seg.span.start, speech_seg.span.end, speaker_attr.value)
More info at https://medkit.readthedocs.io/
See also: Simsamu transcription model