D4ve-R
/

wavlm-base-plus-sv

Transformers.js

Model card Files Files and versions Community

D4ve-R commited on Feb 22

Commit

08ef81e

•

1 Parent(s): a4c4fcf

Update README.md

Files changed (1) hide show

README.md +25 -23

README.md CHANGED Viewed

@@ -34,29 +34,31 @@ The original model can be found under https://github.com/microsoft/unilm/tree/ma
 The model is fine-tuned on the [VoxCeleb1 dataset](https://www.robots.ox.ac.uk/~vgg/data/voxceleb/vox1.html) using an X-Vector head with an Additive Margin Softmax loss
 [X-Vectors: Robust DNN Embeddings for Speaker Recognition](https://www.danielpovey.com/files/2018_icassp_xvectors.pdf)
 # Usage
-## Speaker Verification
-```python
-from transformers import Wav2Vec2FeatureExtractor, WavLMForXVector
-from datasets import load_dataset
-import torch
-dataset = load_dataset("hf-internal-testing/librispeech_asr_demo", "clean", split="validation")
-feature_extractor = Wav2Vec2FeatureExtractor.from_pretrained('microsoft/wavlm-base-plus-sv')
-model = WavLMForXVector.from_pretrained('microsoft/wavlm-base-plus-sv')
-# audio files are decoded on the fly
-audio = [x["array"] for x in dataset[:2]["audio"]]
-inputs = feature_extractor(audio, padding=True, return_tensors="pt")
-embeddings = model(**inputs).embeddings
-embeddings = torch.nn.functional.normalize(embeddings, dim=-1).cpu()
-# the resulting embeddings can be used for cosine similarity-based retrieval
-cosine_sim = torch.nn.CosineSimilarity(dim=-1)
-similarity = cosine_sim(embeddings[0], embeddings[1])
-threshold = 0.86  # the optimal threshold is dataset-dependent
-if similarity < threshold:
-    print("Speakers are not the same!")
 ```
 # License

 The model is fine-tuned on the [VoxCeleb1 dataset](https://www.robots.ox.ac.uk/~vgg/data/voxceleb/vox1.html) using an X-Vector head with an Additive Margin Softmax loss
 [X-Vectors: Robust DNN Embeddings for Speaker Recognition](https://www.danielpovey.com/files/2018_icassp_xvectors.pdf)
 # Usage
+## Speaker Embeddings
+```javascript
+import { AutoProcessor, AutoModel, read_audio } from '@xenova/transformers';
+const processor = await AutoProcessor.from_pretrained('D4ve-R/wavlm-base-plus-sv');
+const url = 'https://huggingface.co/datasets/Xenova/transformers.js-docs/resolve/main/jfk.wav';
+const audio = await read_audio(url, 16000);
+const inputs = await processor(audio);
+const model = await AutoModel.from_pretrained('D4ve-R/wavlm-base-plus-sv', {quantized: false});
+const embeddings = await model(inputs);
+// {
+//   embeddings: Tensor {
+//     dims: [ 1, 512 ],
+//     type: 'float32',
+//     data: Float32Array(512) [-0.349443256855011, -0.39341306686401367,  0.022836603224277496, ...],
+//     size: 512
+//   },
+//   logits: Tensor {
+//     dims: [ 1, 512 ],
+//     type: 'float32',
+//     data: Float32Array(512) [-0.349443256855011, -0.39341306686401367,  0.022836603224277496, ...],
+//     size: 512
+//   }
+// }
 ```
 # License