D4ve-R commited on
Commit
08ef81e
1 Parent(s): a4c4fcf

Update README.md

Browse files
Files changed (1) hide show
  1. README.md +25 -23
README.md CHANGED
@@ -34,29 +34,31 @@ The original model can be found under https://github.com/microsoft/unilm/tree/ma
34
  The model is fine-tuned on the [VoxCeleb1 dataset](https://www.robots.ox.ac.uk/~vgg/data/voxceleb/vox1.html) using an X-Vector head with an Additive Margin Softmax loss
35
  [X-Vectors: Robust DNN Embeddings for Speaker Recognition](https://www.danielpovey.com/files/2018_icassp_xvectors.pdf)
36
  # Usage
37
- ## Speaker Verification
38
- ```python
39
- from transformers import Wav2Vec2FeatureExtractor, WavLMForXVector
40
- from datasets import load_dataset
41
- import torch
42
-
43
- dataset = load_dataset("hf-internal-testing/librispeech_asr_demo", "clean", split="validation")
44
-
45
- feature_extractor = Wav2Vec2FeatureExtractor.from_pretrained('microsoft/wavlm-base-plus-sv')
46
- model = WavLMForXVector.from_pretrained('microsoft/wavlm-base-plus-sv')
47
-
48
- # audio files are decoded on the fly
49
- audio = [x["array"] for x in dataset[:2]["audio"]]
50
- inputs = feature_extractor(audio, padding=True, return_tensors="pt")
51
- embeddings = model(**inputs).embeddings
52
- embeddings = torch.nn.functional.normalize(embeddings, dim=-1).cpu()
53
-
54
- # the resulting embeddings can be used for cosine similarity-based retrieval
55
- cosine_sim = torch.nn.CosineSimilarity(dim=-1)
56
- similarity = cosine_sim(embeddings[0], embeddings[1])
57
- threshold = 0.86 # the optimal threshold is dataset-dependent
58
- if similarity < threshold:
59
- print("Speakers are not the same!")
 
 
60
  ```
61
 
62
  # License
 
34
  The model is fine-tuned on the [VoxCeleb1 dataset](https://www.robots.ox.ac.uk/~vgg/data/voxceleb/vox1.html) using an X-Vector head with an Additive Margin Softmax loss
35
  [X-Vectors: Robust DNN Embeddings for Speaker Recognition](https://www.danielpovey.com/files/2018_icassp_xvectors.pdf)
36
  # Usage
37
+ ## Speaker Embeddings
38
+ ```javascript
39
+ import { AutoProcessor, AutoModel, read_audio } from '@xenova/transformers';
40
+
41
+ const processor = await AutoProcessor.from_pretrained('D4ve-R/wavlm-base-plus-sv');
42
+ const url = 'https://huggingface.co/datasets/Xenova/transformers.js-docs/resolve/main/jfk.wav';
43
+ const audio = await read_audio(url, 16000);
44
+ const inputs = await processor(audio);
45
+
46
+ const model = await AutoModel.from_pretrained('D4ve-R/wavlm-base-plus-sv', {quantized: false});
47
+ const embeddings = await model(inputs);
48
+ // {
49
+ // embeddings: Tensor {
50
+ // dims: [ 1, 512 ],
51
+ // type: 'float32',
52
+ // data: Float32Array(512) [-0.349443256855011, -0.39341306686401367, 0.022836603224277496, ...],
53
+ // size: 512
54
+ // },
55
+ // logits: Tensor {
56
+ // dims: [ 1, 512 ],
57
+ // type: 'float32',
58
+ // data: Float32Array(512) [-0.349443256855011, -0.39341306686401367, 0.022836603224277496, ...],
59
+ // size: 512
60
+ // }
61
+ // }
62
  ```
63
 
64
  # License