File size: 4,091 Bytes
b6b2481
9d5f837
e75dcb6
9d5f837
5d791b2
 
 
 
 
9d5f837
 
 
 
 
 
 
5d791b2
 
 
9d5f837
5d791b2
9d5f837
5d791b2
 
 
 
 
9d5f837
 
 
5d791b2
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
d13b71e
 
 
5d791b2
 
 
61c2cc4
5d791b2
d13b71e
61c2cc4
5d791b2
 
 
 
 
 
 
 
 
9b7e9c3
5d791b2
 
 
00e8f88
61c2cc4
 
 
 
 
 
 
 
3009006
61c2cc4
d13b71e
61c2cc4
 
ae2e3a0
61c2cc4
 
e8409d9
61c2cc4
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
5d791b2
0fb42e9
5d791b2
 
 
 
 
 
452250c
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115
116
117
118
119
120
121
122
123
124
125
126
127
128
129
130
131
132
133
134
135
136
137
138
---
language:
- yue
license: apache-2.0
tags:
- audio
- automatic-speech-recognition
- speech
- xlsr-fine-tuning-week
datasets:
- common_voice
metrics:
- cer
language_bcp47:
- zh-HK
base_model: facebook/wav2vec2-large-xlsr-53
model-index:
- name: wav2vec2-large-xlsr-cantonese
  results:
  - task:
      type: automatic-speech-recognition
      name: Speech Recognition
    dataset:
      name: Common Voice zh-HK
      type: common_voice
      args: zh-HK
    metrics:
    - type: cer
      value: 15.36
      name: Test CER
---

# Wav2Vec2-Large-XLSR-53-Cantonese

Fine-tuned [facebook/wav2vec2-large-xlsr-53](https://huggingface.co/facebook/wav2vec2-large-xlsr-53) on Cantonese using the [Common Voice](https://huggingface.co/datasets/common_voice).
When using this model, make sure that your speech input is sampled at 16kHz.

## Usage

The model can be used directly (without a language model) as follows:

```python
import torch
import torchaudio
from datasets import load_dataset
from transformers import Wav2Vec2ForCTC, Wav2Vec2Processor

test_dataset = load_dataset("common_voice", "zh-HK", split="test[:2%]")

processor = Wav2Vec2Processor.from_pretrained("ctl/wav2vec2-large-xlsr-cantonese") 
model = Wav2Vec2ForCTC.from_pretrained("ctl/wav2vec2-large-xlsr-cantonese")

resampler = torchaudio.transforms.Resample(48_000, 16_000)

# Preprocessing the datasets.
# We need to read the aduio files as arrays
def speech_file_to_array_fn(batch):
	speech_array, sampling_rate = torchaudio.load(batch["path"])
	batch["speech"] = resampler(speech_array).squeeze().numpy()
	return batch

test_dataset = test_dataset.map(speech_file_to_array_fn)
inputs = processor(test_dataset["speech"][:2], sampling_rate=16_000, return_tensors="pt", padding=True)

with torch.no_grad():
	logits = model(inputs.input_values, attention_mask=inputs.attention_mask).logits

predicted_ids = torch.argmax(logits, dim=-1)

print("Prediction:", processor.batch_decode(predicted_ids))
print("Reference:", test_dataset["sentence"][:2])
```


## Evaluation

The model can be evaluated as follows on the Chinese (Hong Kong) test data of Common Voice. 


```python
!pip install jiwer
import torch
import torchaudio
from datasets import load_dataset, load_metric
from transformers import Wav2Vec2ForCTC, Wav2Vec2Processor
import re
import argparse

lang_id = "zh-HK" 
model_id = "ctl/wav2vec2-large-xlsr-cantonese"

chars_to_ignore_regex = '[\,\?\.\!\-\;\:"\“\%\‘\”\�\.\⋯\!\-\:\–\。\》\,\)\,\?\;\~\~\…\︰\,\(\」\‧\《\﹔\、\—\/\,\「\﹖\·\']'

test_dataset = load_dataset("common_voice", f"{lang_id}", split="test") 
cer = load_metric("cer")

processor = Wav2Vec2Processor.from_pretrained(f"{model_id}") 
model = Wav2Vec2ForCTC.from_pretrained(f"{model_id}") 
model.to("cuda")

resampler = torchaudio.transforms.Resample(48_000, 16_000)

# Preprocessing the datasets.
# We need to read the aduio files as arrays
def speech_file_to_array_fn(batch):
    batch["sentence"] = re.sub(chars_to_ignore_regex, '', batch["sentence"]).lower()
    speech_array, sampling_rate = torchaudio.load(batch["path"])
    batch["speech"] = resampler(speech_array).squeeze().numpy()
    return batch

test_dataset = test_dataset.map(speech_file_to_array_fn)

# Preprocessing the datasets.
# We need to read the aduio files as arrays
def evaluate(batch):
    inputs = processor(batch["speech"], sampling_rate=16_000, return_tensors="pt", padding=True)
    with torch.no_grad():
        logits = model(inputs.input_values.to("cuda"), attention_mask=inputs.attention_mask.to("cuda")).logits

    pred_ids = torch.argmax(logits, dim=-1)
    batch["pred_strings"] = processor.batch_decode(pred_ids)
    return batch

result = test_dataset.map(evaluate, batched=True, batch_size=16)

print("CER: {:2f}".format(100 * cer.compute(predictions=result["pred_strings"], references=result["sentence"])))
```


**Test Result**: 15.51 % 


## Training

The Common Voice `train`, `validation` were used for training.

The script used for training will be posted [here](https://github.com/chutaklee/CantoASR)