File size: 1,580 Bytes
11f413e
0ca4d79
11f413e
 
 
 
8c0524d
 
 
 
 
 
 
 
 
 
 
4275b0f
 
 
300a419
4275b0f
f3692a8
300a419
4275b0f
f3692a8
300a419
3c051dd
4275b0f
 
 
 
300a419
4275b0f
8c0524d
 
 
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
---
license: llama3
language:
- en
- hi
---

`Shuka v1` is a language model which natively understands audio in Indic languages. It is an encoder-decoder model built by combining two models:
- Our state-of-the-art, in-house, audio encoder: Saaras v1
- Meta’s Llama3-8B-Instruct as the decoder

The encoder and decoder are connected by a small projector with ~60M parameters. During training, only the projector weights are finetuned while the rest of the network is frozen. Following our tradition of training models frugally, we train `Shuka v1` on less than 100 hours of audio.

Though we only finetune the projector on English and Hindi data, the multilingual nature of our encoder makes `Shuka v1` perform well on zero-shot QA in other Indic languages as well. We have tested on the model on Bengali, English, Gujarati, Hindi, Kannada, Malayalam, Marathi, Oriya, Punjabi, Tamil, and Telugu.

You can get started by using huggingface pipeline, as follows:

```
import transformers
import librosa

# load the model pipeline on gpu:0
pipe = transformers.pipeline(model='sarvamai/shuka_v1', trust_remote_code=True, device=0)

# get a sample audio
# wget https://huggingface.co/sarvamai/shuka_v1/resolve/main/hi-question.webm

audio, sr = librosa.load("./hi-question.webm", sr=16000)
turns = [
          {'role': 'system', 'content': 'Respond naturally and informatively.'},
          {'role': 'user', 'content': '<|audio|>'}
        ]

pipe({'audio': audio, 'turns': turns, 'sampling_rate': sr}, max_new_tokens=512)
```

For more details, please see our blog (link coming soon).