File size: 1,755 Bytes
066d5b1
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
9c00dc1
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
---
language: kr
datasets:
- aihub 자유대화 음성(노인남녀)
tags:
- automatic-speech-recognition
license: apache-2.0
---


# wav2vec2-xlsr-korean-senior

Futher fine-tuned [fleek/wav2vec-large-xlsr-korean](https://huggingface.co/fleek/wav2vec-large-xlsr-korean) using the [AIhub 자유대화 음성(노인남녀)](https://aihub.or.kr/aidata/30704).

- Total train data size: 808,642
- Total vaild data size: 159,970

When using this model, make sure that your speech input is sampled at 16kHz.

The script used for training can be found here: https://github.com/hyyoka/wav2vec2-korean-senior


### Inference

``` py
import torchaudio
from transformers import Wav2Vec2Processor, Wav2Vec2ForCTC
import re

def clean_up(transcription):
    hangul = re.compile('[^ ㄱ-ㅣ가-힣]+')
    result = hangul.sub('', transcription)
    return result

model_name "hyyoka/wav2vec2-xlsr-korean-senior"
processor = Wav2Vec2Processor.from_pretrained(model_name)
model = Wav2Vec2ForCTC.from_pretrained(model_name)
speech_array, sampling_rate = torchaudio.load(wav_file)
feat = processor(speech_array[0], 
                            sampling_rate=16000, 
                            padding=True,
                            max_length=800000, 
                            truncation=True,
                            return_attention_mask=True,
                            return_tensors="pt",
                            pad_token_id=49
                            )
input = {'input_values': feat['input_values'],'attention_mask':feat['attention_mask']}

outputs = model(**input, output_attentions=True)
logits = outputs.logits
predicted_ids = logits.argmax(axis=-1)
transcription = processor.decode(predicted_ids[0])
stt_result = clean_up(transcription)
```