File size: 5,009 Bytes
898044c
56d8789
 
 
898044c
f5f12a4
 
 
56d8789
 
 
f5f12a4
 
56d8789
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
f5f12a4
56d8789
 
 
 
 
 
 
 
 
 
 
 
 
f5f12a4
56d8789
 
 
 
 
 
 
 
 
 
 
 
fc0b57e
56d8789
 
 
 
 
 
 
 
 
 
 
 
f5f12a4
56d8789
 
f5f12a4
56d8789
898044c
 
56d8789
 
7a2d33d
d995701
7a2d33d
 
 
f5f12a4
56d8789
f5f12a4
56d8789
 
 
 
 
 
 
 
 
 
 
 
 
 
6fc25f7
f5f12a4
56d8789
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
f5f12a4
76cd9dc
898044c
 
 
56d8789
 
 
 
 
3f4056f
90ce71a
56d8789
 
898044c
 
56d8789
898044c
f5f12a4
 
 
 
f4e64ad
 
 
 
 
 
 
 
 
 
 
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115
116
117
118
119
120
121
122
123
124
125
126
127
128
129
130
131
132
133
134
135
136
137
138
139
140
141
142
143
144
145
146
147
148
149
150
151
152
153
154
---
language:
- en
license: apache-2.0
library_name: transformers
tags:
- multi-modal
- speech-language
datasets:
- mozilla-foundation/common_voice_16_1
- openslr/librispeech_asr
- MLCommons/ml_spoken_words
- Ar4ikov/iemocap_audio_text_splitted
metrics:
- wer
- accuracy
model-index:
- name: SpeechLLM
  results:
  - task:
      type: automatic-speech-recognition
      name: Automatic Speech Recognition
    dataset:
      name: LibriSpeech (clean)
      type: librispeech_asr
      config: clean
      split: test
      args:
        language: en
    metrics:
    - type: wer
      value: 11.51
      name: Test WER
  - task:
      type: automatic-speech-recognition
      name: Automatic Speech Recognition
    dataset:
      name: LibriSpeech (other)
      type: librispeech_asr
      config: other
      split: test
      args:
        language: en
    metrics:
    - type: wer
      value: 16.68
      name: Test WER
  - task:
      type: automatic-speech-recognition
      name: Automatic Speech Recognition
    dataset:
      name: Common Voice 16.1
      type: common_voice_16_1
      split: test
      args:
        language: en
    metrics:
    - type: wer
      value: 26.02
      name: Test WER
  - task:
      type: audio-classification
      name: Audio Classification
    dataset:
      name: Common Voice 16.1
      type: common_voice_16_1
      split: test
      args:
        language: en
    metrics:
    - type: accuracy
      value: 64.98
      name: Test Age Accuracy
    - type: accuracy
      value: 81.21
      name: Test Accent Accuracy
---

# SpeechLLM

[![github](https://img.shields.io/badge/-Github-black?logo=github)](https://github.com/skit-ai/SpeechLLM.git)
[![Code License](https://img.shields.io/badge/Code%20License-Apache_2.0-green.svg)](https://github.com/skit-ai/SpeechLLM/blob/main/LICENSE)
[![Open in Colab](https://img.shields.io/badge/Open%20in%20Colab-F9AB00?logo=googlecolab&color=blue)](https://colab.research.google.com/drive/1uqhRl36LJKA4IxnrhplLMv0wQ_f3OuBM?usp=sharing)


![](./speechllm.png)

SpeechLLM is a multi-modal LLM trained to predict the metadata of the speaker's turn in a conversation. speechllm-2B model is based on HubertX audio encoder and TinyLlama LLM. The model predicts the following:
1. **SpeechActivity** : if the audio signal contains speech (True/False)
2. **Transcript** : ASR transcript of the audio
3. **Gender** of the speaker (Female/Male)
4. **Age** of the speaker (Young/Middle-Age/Senior)
5. **Accent** of the speaker (Africa/America/Celtic/Europe/Oceania/South-Asia/South-East-Asia)
6. **Emotion** of the speaker (Happy/Sad/Anger/Neutral/Frustrated)

## Usage
```python
# Load model directly from huggingface
from transformers import AutoModel
model = AutoModel.from_pretrained("skit-ai/speechllm-1.5B", trust_remote_code=True)

model.generate_meta(
	audio_path="path-to-audio.wav", #16k Hz, mono
    audio_tensor=torchaudio.load("path-to-audio.wav")[1], # [Optional] either audio_path or audio_tensor directly
	instruction="Give me the following information about the audio [SpeechActivity, Transcript, Gender, Emotion, Age, Accent]",
	max_new_tokens=500, 
	return_special_tokens=False
)

# Model Generation
'''
{
  "SpeechActivity" : "True",
  "Transcript": "Yes, I got it. I'll make the payment now.",
  "Gender": "Female",
  "Emotion": "Neutral",
  "Age": "Young",
  "Accent" : "America",
}
'''
```

Try the model in [Google Colab Notebook](https://colab.research.google.com/drive/1uqhRl36LJKA4IxnrhplLMv0wQ_f3OuBM?usp=sharing). Also, check out our [blog](https://tech.skit.ai/speech-conversational-llms/) on SpeechLLM for end-to-end conversational agents(User Speech -> Response).

## Model Details

- **Developed by:** Skit AI
- **Authors:** [Shangeth Rajaa](https://huggingface.co/shangeth), [Abhinav Tushar](https://huggingface.co/lepisma)
- **Language:** English
- **Finetuned from model:** [WavLM](https://huggingface.co/microsoft/wavlm-large), [TinyLlama](https://huggingface.co/TinyLlama/TinyLlama-1.1B-Chat-v1.0)
- **Model Size:** 1.5 B
- **Checkpoint:** 1200 k steps (bs=1)
- **Adapters:** r=8, alpha=16
- **lr** : 1e-4
- **gradient accumulation steps:** 8


## Checkpoint Result

|         **Dataset**        |       **Type**      | **Word Error Rate** | **Gender Acc** | **Age Acc** | **Accent Acc** |
|:--------------------------:|:-------------------:|:-------------------:|:--------------:|:-----------:|:--------------:|
| **librispeech-test-clean** | Read Speech         |        11.51        |     0.9594     |             |                |
| **librispeech-test-other** | Read Speech         |        16.68        |     0.9297     |             |                |
| **CommonVoice test**       | Diverse Accent, Age |        26.02        |     0.9476     |    0.6498   |     0.8121     |


## Cite
```
@misc{Rajaa_SpeechLLM_Multi-Modal_LLM,
author = {Rajaa, Shangeth and Tushar, Abhinav},
title = {{SpeechLLM: Multi-Modal LLM for Speech Understanding}},
url = {https://github.com/skit-ai/SpeechLLM}
}
```