MusicGen-Style - 1.5B

AudioCraft provides the code and models for MusicGen-Style.

MusicGen-Style is a text-and-audio-to-music model that can be conditioned on textual and audio data (style conditioner). The style conditioner takes as input a music excerpt of a few seconds (between 1.5 and 4.5) extracts some features that are used by the model to generate music in the same style. This style conditioning can be mixed with textual description.

MusicGen-Style was published in Audio Conditioning for Music Generation via Discrete Bottleneck Features by Simon Rouard, Yossi Adi, Jade Copet, Axel Roebel, Alexandre Défossez.

Example

Try out MusicGen-Style yourself!

You can run MusicGen-Style locally:

First install the audiocraft library

pip install git+https://github.com/facebookresearch/audiocraft.git

Make sure to have ffmpeg installed:

apt get install ffmpeg

Run the following Python code:

import torchaudio
from audiocraft.models import MusicGen
from audiocraft.data.audio import audio_write

model = MusicGen.get_pretrained('style')
model.set_generation_params(duration=8)  # generate 8 seconds.

descriptions = ['happy rock', 'energetic EDM', 'sad jazz']

melody, sr = torchaudio.load('./assets/bach.mp3')
# generates using the melody from the given audio and the provided descriptions.
wav = model.generate_with_chroma(descriptions, melody[None].expand(3, -1, -1), sr)

for idx, one_wav in enumerate(wav):
    # Will save under {idx}.wav, with loudness normalization at -14 db LUFS.
    audio_write(f'{idx}', one_wav.cpu(), model.sample_rate, strategy="loudness")

Model details

Organization developing the model: The FAIR team of Meta AI.

Model date: MusicGen-Style was trained between September 2023 and December 2023.

Model version: This is the version 1 of the model.

Model type: MusicGen-Style consists of an EnCodec model for audio tokenization, an auto-regressive language model based on the transformer architecture for music modeling. The model size is 1.5B. It takes short audio excerpts for style extraction as well as text as inputs to generate music.

Paper or resources for more information: More information can be found in the paper Audio Conditioning for Music Generation via Discrete Bottleneck Features.

Citation details:

@misc{rouard2024audioconditioningmusicgeneration,
      title={Audio Conditioning for Music Generation via Discrete Bottleneck Features}, 
      author={Simon Rouard and Yossi Adi and Jade Copet and Axel Roebel and Alexandre Défossez},
      year={2024},
      eprint={2407.12563},
      archivePrefix={arXiv},
      primaryClass={cs.SD},
      url={https://arxiv.org/abs/2407.12563}, 
}

License: Code is released under MIT, model weights are released under CC-BY-NC 4.0.

Where to send questions or comments about the model: Questions and comments about MusicGen can be sent via the Github repository of the project, or by opening an issue.

Intended use

Primary intended use: The primary use of MusicGen-Style is research on AI-based music generation, including:

Research efforts, such as probing and better understanding the limitations of generative models to further improve the state of science
Generation of music guided by text or melody to understand current abilities of generative AI models by machine learning amateurs

Primary intended users: The primary intended users of the model are researchers in audio, machine learning and artificial intelligence, as well as amateur seeking to better understand those models.

Out-of-scope use cases: The model should not be used on downstream applications without further risk evaluation and mitigation. The model should not be used to intentionally create or disseminate music pieces that create hostile or alienating environments for people. This includes generating music that people would foreseeably find disturbing, distressing, or offensive; or content that propagates historical or current stereotypes.

Limitations and biases

Data: The data sources used to train the model are created by music professionals and covered by legal agreements with the right holders. The model is trained on 20K hours of data, we believe that scaling the model on larger datasets can further improve the performance of the model.

Mitigations: Vocals have been removed from the data source using corresponding tags, and then using a state-of-the-art music source separation method, namely using the open source Hybrid Transformer for Music Source Separation (HT-Demucs).

Limitations:

The model is not able to generate realistic vocals.
The model has been trained with English descriptions and will not perform as well in other languages.
The model does not perform equally well for all music styles and cultures.
The model sometimes generates end of songs, collapsing to silence.
It is sometimes difficult to assess what types of text descriptions provide the best generations. Prompt engineering may be required to obtain satisfying results.
It can be hard to balance well text description and the style condition. We recommend starting with cfg_coef=3 and cfg_coef_2=5. The first coefficient pushes the style conditioning and the second one pushes the text conditioning.

Biases: The source of data is potentially lacking diversity and all music cultures are not equally represented in the dataset. The model may not perform equally well on the wide variety of music genres that exists. The generated samples from the model will reflect the biases from the training data. Further work on this model should include methods for balanced and just representations of cultures, for example, by scaling the training data to be both diverse and inclusive.

Risks and harms: Biases and limitations of the model may lead to generation of samples that may be considered as biased, inappropriate or offensive. We believe that providing the code to reproduce the research and train new models will allow to broaden the application to new and more representative data.

Use cases: Users must be aware of the biases, limitations and risks of the model. MusicGen is a model developed for artificial intelligence research on controllable music generation. As such, it should not be used for downstream applications without further investigation and mitigation of risks.

facebook
/

musicgen-style

MusicGen-Style - 1.5B

Example

Model details

Intended use

Limitations and biases

Space using facebook/musicgen-style 1