|
--- |
|
license: mit |
|
language: |
|
- en |
|
library_name: mlx |
|
pipeline_tag: text-to-speech |
|
tags: |
|
- nlp |
|
- tts |
|
- bark |
|
--- |
|
|
|
## Model Summary |
|
|
|
Bark is a transformer based text-to-audio model that can generate speech and miscellaneous audio i.e. background noise / music. |
|
|
|
This is a port of Suno's Bark model in Apple's ML Framework, MLX. The intention of the port is to explore the potential in making fast on-device TTS inference possible. |
|
|
|
This repository contains the Bark weights in `npz` format suitable for use with Apple's MLX Framework. |
|
|
|
## Repo links |
|
|
|
- [Original Repo](https://github.com/suno-ai/bark) |
|
- [MLX Bark Repo](https://github.com/j-csc/mlx_bark) |
|
|
|
## Usage |
|
|
|
```bash |
|
# Setup |
|
pip install transformers huggingface_hub hf_transfer |
|
git clone https://github.com/j-csc/mlx_bark |
|
cd mlx_bark |
|
pip install -r requirements.txt |
|
|
|
# Download model |
|
export HF_HUB_ENABLE_HF_TRANSFER=1 |
|
huggingface-cli download --local-dir-use-symlinks False --local-dir weights/ mlx-community/mlx_bark |
|
|
|
# Run example (large model) |
|
python model.py --text="Hello world!" --path weights/ --model large |
|
``` |
|
The rest of the model card was copied from [the original Bark repository](https://huggingface.co/suno/bark) |
|
|
|
## Model Details |
|
|
|
The following is additional information about the models released here. |
|
|
|
Bark is a series of three transformer models that turn text into audio. |
|
|
|
### Text to semantic tokens |
|
- Input: text, tokenized with [BERT tokenizer from Hugging Face](https://huggingface.co/docs/transformers/model_doc/bert#transformers.BertTokenizer) |
|
- Output: semantic tokens that encode the audio to be generated |
|
|
|
### Semantic to coarse tokens |
|
- Input: semantic tokens |
|
- Output: tokens from the first two codebooks of the [EnCodec Codec](https://github.com/facebookresearch/encodec) from facebook |
|
|
|
### Coarse to fine tokens |
|
- Input: the first two codebooks from EnCodec |
|
- Output: 8 codebooks from EnCodec |
|
|
|
### Architecture |
|
| Model | Parameters | Attention | Output Vocab size | |
|
|:-------------------------:|:----------:|------------|:-----------------:| |
|
| Text to semantic tokens | 80/300 M | Causal | 10,000 | |
|
| Semantic to coarse tokens | 80/300 M | Causal | 2x 1,024 | |
|
| Coarse to fine tokens | 80/300 M | Non-causal | 6x 1,024 | |
|
|
|
|
|
### Release date |
|
April 2023 |
|
|
|
## Broader Implications |
|
We anticipate that this model's text to audio capabilities can be used to improve accessbility tools in a variety of languages. |
|
|
|
While we hope that this release will enable users to express their creativity and build applications that are a force |
|
for good, we acknowledge that any text to audio model has the potential for dual use. While it is not straightforward |
|
to voice clone known people with Bark, it can still be used for nefarious purposes. To further reduce the chances of unintended use of Bark, |
|
we also release a simple classifier to detect Bark-generated audio with high accuracy (see notebooks section of the main repository). |
|
|