Update README.md
Browse files
README.md
CHANGED
@@ -6,12 +6,13 @@ metrics:
|
|
6 |
- accuracy
|
7 |
pipeline_tag: text-generation
|
8 |
---
|
9 |
-
# πΌ ChatMusician:
|
10 |
|
11 |
-
[**π DemoPage**](https://ezmonyi.github.io/ChatMusician/) | [**π€ Dataset**](https://huggingface.co/datasets/m-a-p/MusicPile) | [**π€ Benchmark**](https://huggingface.co/datasets/m-a-p/MusicTheoryBench) | [**π arXiv**](http://arxiv.org/abs/2402.16153) | [**Code**](https://github.com/hf-lin/ChatMusician) | [
|
12 |
|
13 |
## πNews
|
14 |
-
- **π₯[
|
|
|
15 |
- [2023-11-30]: Checkout another awesome project [MMMU](https://huggingface.co/datasets/MMMU/MMMU/) that includes multimodal music reasoning.
|
16 |
|
17 |
## Introduction
|
@@ -24,30 +25,188 @@ It is based on continual pre-training and finetuning LLaMA2 on a text-compatible
|
|
24 |
margin. Our work reveals that LLMs can be an excellent compressor for music, but there remains significant territory to be conquered. Code, data, model, and benchmark are open-sourced.
|
25 |
|
26 |
<!-- <audio controls src="https://cdn-uploads.huggingface.co/production/uploads/5fd6f670053c8345eddc1b68/8NSONUjIF7KGUCfwzPCd9.mpga"></audio> -->
|
|
|
27 |
[![Demo Video](chatmusician_demo.png)](https://youtu.be/zt3l49K55Io)
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
28 |
|
29 |
-
|
|
|
|
|
30 |
|
31 |
-
##
|
|
|
|
|
|
|
|
|
|
|
32 |
|
33 |
-
|
|
|
34 |
|
35 |
-
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
36 |
|
37 |
-
We initialized a fp16-precision ChatMusician-Base from the LLaMA2-7B-Base weights, and applied a continual pre-training plus fine-tuning pipeline. LoRA adapters were integrated into the attention and MLP layers, with additional training on embeddings and all linear layers. The maximum sequence length
|
38 |
-
was 2048. We utilized 16 80GB-A800 GPUs for one epoch pre-training. DeepSpeed was employed for memory efficiency, and the AdamW optimizer was used with a 1e-4 learning rate and a 5% warmup cosine scheduler. Gradient clipping was set at 1.0. The LoRA parameters dimension, alpha, and dropout were set to 64, 16, and 0.1, with a batch size of 8.
|
39 |
|
40 |
## Evaluation
|
41 |
|
42 |
1. Music understanding abilities are evaluated on the [MusicTheoryBench](https://huggingface.co/datasets/m-a-p/MusicTheoryBench). The following figure is zero-shot accuracy on MusicTheoryBench.
|
43 |
We included GPT-3.5, GPT-4, LLaMA2-7B-Base, ChatMusician-Base, and ChatMusician. The blue bar represents the performance on the music knowledge metric, and the red bar represents the music reasoning metric. The dashed line corresponds to a random baseline, with a score of 25%.![MusicTheoryBench_result](./MusicTheoryBench_result_plt.png)
|
44 |
-
2. General language abilities of ChatMusician are evaluated
|
45 |
|
46 |
|
47 |
-
## Limitations
|
48 |
|
49 |
-
The current iteration of ChatMusician predominantly generates music in the style of Irish music, attributable to a significant portion of the dataset being sourced from this genre.
|
50 |
-
The model exhibits hallucinations and faces limitations in supporting open-ended music generation tasks due to the lack of diversity in handcrafted music instructions.
|
51 |
|
52 |
## Citation
|
53 |
If you find our work helpful, feel free to give us a cite.
|
|
|
6 |
- accuracy
|
7 |
pipeline_tag: text-generation
|
8 |
---
|
9 |
+
# πΌ ChatMusician: Understanding and Generating Music Intrinsically with LLM
|
10 |
|
11 |
+
[**π DemoPage**](https://ezmonyi.github.io/ChatMusician/) | [**π€ Dataset**](https://huggingface.co/datasets/m-a-p/MusicPile) | [**π€ Benchmark**](https://huggingface.co/datasets/m-a-p/MusicTheoryBench) | [**π arXiv**](http://arxiv.org/abs/2402.16153) | [π» **Code**](https://github.com/hf-lin/ChatMusician) | [**π€ Chat Model**](https://huggingface.co/m-a-p/ChatMusician)
|
12 |
|
13 |
## πNews
|
14 |
+
- **π₯[2024-2-28]: The release of ChatMusician's demo, code, model, data, and benchmark. π**
|
15 |
+
- [2024-2-28]: ChatMusician uses a fast symbolic music processing and rendering library, `symusic`. Developed by Yikai-Liao, lzqlzzq and Natooz. Find the project on Github: https://github.com/Yikai-Liao/symusic
|
16 |
- [2023-11-30]: Checkout another awesome project [MMMU](https://huggingface.co/datasets/MMMU/MMMU/) that includes multimodal music reasoning.
|
17 |
|
18 |
## Introduction
|
|
|
25 |
margin. Our work reveals that LLMs can be an excellent compressor for music, but there remains significant territory to be conquered. Code, data, model, and benchmark are open-sourced.
|
26 |
|
27 |
<!-- <audio controls src="https://cdn-uploads.huggingface.co/production/uploads/5fd6f670053c8345eddc1b68/8NSONUjIF7KGUCfwzPCd9.mpga"></audio> -->
|
28 |
+
|
29 |
[![Demo Video](chatmusician_demo.png)](https://youtu.be/zt3l49K55Io)
|
30 |
+
<!-- [![ChatMusician Introduction](http://img.youtube.com/vi/zt3l49K55Io/0.jpg))](http://www.youtube.com/watch?v=zt3l49K55Io "ChatMusician Introduction") -->
|
31 |
+
<!-- <iframe width="787" height="528" src="https://www.youtube.com/embed/zt3l49K55Io" title="ChatMusician: Fostering Intrinsic Musical Abilities Into LLM" frameborder="0" allow="accelerometer; autoplay; clipboard-write; encrypted-media; gyroscope; picture-in-picture; web-share" allowfullscreen></iframe>
|
32 |
+
|
33 |
+
## Usage
|
34 |
+
|
35 |
+
You can use the models through Huggingface's Transformers library. Check our Github repo for more advanced use: [https://github.com/hf-lin/ChatMusician](https://github.com/hf-lin/ChatMusician) -->
|
36 |
+
|
37 |
+
## CLI demo
|
38 |
+
```python
|
39 |
+
from transformers import AutoTokenizer, AutoModelForCausalLM, GenerationConfig
|
40 |
+
import torch
|
41 |
+
import re
|
42 |
+
from string import Template
|
43 |
+
prompt_template = Template("Human: ${inst} </s> Assistant: ")
|
44 |
+
|
45 |
+
tokenizer = AutoTokenizer.from_pretrained("m-a-p/ChatMusician", trust_remote_code=True)
|
46 |
+
# you may replace "m-a-p/ChatMusician-Base" with "m-a-p/ChatMusician", since the base model may not follow instructions.
|
47 |
+
model = AutoModelForCausalLM.from_pretrained("m-a-p/ChatMusician-Base", torch_dtype=torch.float16, device_map="cuda", resume_download=True).eval()
|
48 |
+
|
49 |
+
generation_config = GenerationConfig(
|
50 |
+
temperature=0.2,
|
51 |
+
top_k=40,
|
52 |
+
top_p=0.9,
|
53 |
+
do_sample=True,
|
54 |
+
num_beams=1,
|
55 |
+
repetition_penalty=1.1,
|
56 |
+
min_new_tokens=10,
|
57 |
+
max_new_tokens=1536
|
58 |
+
)
|
59 |
+
|
60 |
+
instruction = """Develop a musical piece using the given chord progression.
|
61 |
+
'Dm', 'C', 'Dm', 'Dm', 'C', 'Dm', 'C', 'Dm'
|
62 |
+
"""
|
63 |
+
|
64 |
+
prompt = prompt_template.safe_substitute({"inst": instruction})
|
65 |
+
inputs = tokenizer(prompt, return_tensors="pt", add_special_tokens=False)
|
66 |
+
response = model.generate(
|
67 |
+
input_ids=inputs["input_ids"].to(model.device),
|
68 |
+
attention_mask=inputs['attention_mask'].to(model.device),
|
69 |
+
eos_token_id=tokenizer.eos_token_id,
|
70 |
+
generation_config=generation_config,
|
71 |
+
)
|
72 |
+
response = tokenizer.decode(response[0][inputs["input_ids"].shape[1]:], skip_special_tokens=True)
|
73 |
+
print(response)
|
74 |
+
|
75 |
+
# to render abc notation, you need to install symusic
|
76 |
+
# pip install symusic
|
77 |
+
from symusic import Score, Synthesizer, BuiltInSF3, dump_wav
|
78 |
+
|
79 |
+
abc_pattern = r'(X:\d+\n(?:[^\n]*\n)+)'
|
80 |
+
abc_notation = re.findall(abc_pattern, response+'\n')[0]
|
81 |
+
s = Score.from_abc(abc_notation)
|
82 |
+
audio = Synthesizer().render(s, stereo=True)
|
83 |
+
dump_wav('cm_music_piece.wav', audio, sample_rate=44100, use_int16=True)
|
84 |
+
```
|
85 |
|
86 |
+
## Chat demo
|
87 |
+
ChatMusician supports gradio web demo and multi-turn dialogue, please visit our [github](https://github.com/hf-lin/ChatMusician) for more details.
|
88 |
+
Our web demo also supports rendering ABC scores into images.
|
89 |
|
90 |
+
## Limitations
|
91 |
+
- The model currently only supports strict format and close-ended instructions for the music tasks. If we have more funding, we plan to create a more diverse multi-turn music instruction chat data for better generalization.
|
92 |
+
- The model suffers from hallucinations, and shouldn't be used for music education. It could be improved by feeding more music textbooks, blogs, etc. And RLHF may help, too.
|
93 |
+
- A large portion of the training data is in the style of Irish music. If possible, the community should develop a converter between performance midi and ABC scores, so that we can include more established midi datasets.
|
94 |
+
- The MusicThoeryBench results reported in the paper are obtained with perplexity mode. Direct generation may result in a worse performance.
|
95 |
+
- We observe that using the current version of training data, ChatMusician presents a weak in-context-learning and chain-of-thoughts ability. The community should work on improving the music data quality.
|
96 |
|
97 |
+
## Example Stable Prompts
|
98 |
+
We provide some of the prompts that are tested to be stable. For more prompts, please check π€ [MusicPile](https://huggingface.co/datasets/m-a-p/MusicPile).
|
99 |
|
100 |
+
### Function: Chord Conditioned Music Generation
|
101 |
+
```
|
102 |
+
Develop a musical piece using the given chord progression.
|
103 |
+
'Dm', 'C', 'Dm', 'Dm', 'C', 'Dm', 'C', 'Dm'
|
104 |
+
```
|
105 |
+
|
106 |
+
### Function: Text2music
|
107 |
+
```
|
108 |
+
Develop a tune influenced by Bach's compositions.
|
109 |
+
```
|
110 |
+
```
|
111 |
+
Using ABC notation, recreate the given text as a musical score.
|
112 |
+
Meter C
|
113 |
+
Notes The parts are commonly interchanged.
|
114 |
+
Transcription 1997 by John Chambers
|
115 |
+
Key D
|
116 |
+
Note Length 1/8
|
117 |
+
Rhythm reel
|
118 |
+
```
|
119 |
+
|
120 |
+
### Function: Melody Harmonization
|
121 |
+
|
122 |
+
```
|
123 |
+
Construct smooth-flowing chord progressions for the supplied music.
|
124 |
+
|
125 |
+
|: BA | G2 g2"^(C)" edeg | B2 BA"^(D7)" BcBA | G2 g2 edeg | dBAG A2 BA |
|
126 |
+
G2 g2"^(C)" edeg | B2 BA B2 d2 | e2 ef e2 (3def | gedB A2 :: BA | G2 BG dGBe |
|
127 |
+
dBBA"^(D7)" B3 A | G2 BG dGBe | dBAG A4 | G2 BG dGBe | dBBA B3 d |
|
128 |
+
e2 ef e2 (3def | gedB A2 :|
|
129 |
+
```
|
130 |
+
```
|
131 |
+
Develop a series of chord pairings that amplify the harmonious elements in the given music piece.
|
132 |
+
|
133 |
+
E |: EAA ABc | Bee e2 d | cBA ABc | BEE E2 D | EAA ABc | Bee e2 d |
|
134 |
+
cBA ^GAB |1 A2 A A2 E :|2 A2 A GAB || c3 cdc | Bgg g2 ^g | aed cBA |
|
135 |
+
^GAB E^F^G | A^GA BAB | cde fed | cBA ^GAB |1 A2 A GAB :|2 \n A3 A2 ||
|
136 |
+
```
|
137 |
+
|
138 |
+
### Function: Musical Form Conditioned Music Generation
|
139 |
+
|
140 |
+
```
|
141 |
+
Develop a composition by incorporating elements from the given melodic structure.
|
142 |
+
|
143 |
+
Ternary, Sectional: Verse/Chorus/Bridge
|
144 |
+
```
|
145 |
+
|
146 |
+
### Function: Motif and Form Conditioned Music Generation
|
147 |
+
|
148 |
+
```
|
149 |
+
Create music by incorporating the assigned motif into the predetermined musical arrangement.
|
150 |
+
|
151 |
+
Musical Form Input: Only One Section
|
152 |
+
|
153 |
+
ABC Notation Music Input:
|
154 |
+
X:1
|
155 |
+
L:1/8
|
156 |
+
M:9/8
|
157 |
+
K:Emin
|
158 |
+
vB2 E E2 F G2 A
|
159 |
+
```
|
160 |
+
|
161 |
+
### Function: Music Understanding
|
162 |
+
|
163 |
+
```
|
164 |
+
Investigate the aspects of this musical work and convey its structural organization using suitable musical words.
|
165 |
+
|
166 |
+
X:1
|
167 |
+
L:1/8
|
168 |
+
M:2/2
|
169 |
+
K:G
|
170 |
+
G2 dG BGdG | G2 dc BAGB | A2 eA cAeA | A2 ed cAFA |
|
171 |
+
G2 dG BGdG | G2 dc BAGB | ABcd efge |1 aged cAFA :|2
|
172 |
+
aged ^cdef |: g3 f g2 ef | gedc BA G2 | eaag agea |
|
173 |
+
aged ^cdef | g3 f g2 ef |gedc BAGB | ABcd efge |1
|
174 |
+
aged ^cdef :|2 aged cAFA |:"^variations:" G2 BG dGBA |
|
175 |
+
G2 dG BAGB | A2 cA eAcA | A2 ed cAFA | G2 BG dGBA |
|
176 |
+
G2 dc BAGB | ABcd efge |1 aged cAFA :|2 aged ^cdef |:
|
177 |
+
g2 af g2 ef | gedc BAGB | Aaag ageg | aged ^cdef |
|
178 |
+
gbaf g2 ef | gedc BAGB | ABcd efge |1
|
179 |
+
aged ^cdef :|2 aged cAFA ||
|
180 |
+
```
|
181 |
+
|
182 |
+
```
|
183 |
+
Analyze the musical work and pinpoint the consistent melodic element in every section.
|
184 |
+
|
185 |
+
X:1
|
186 |
+
L:1/8
|
187 |
+
M:4/4
|
188 |
+
K:G
|
189 |
+
ge | d2 G2 cBAG | d2 G2 cBAG | e2 A2 ABcd | edcB A2 Bc |
|
190 |
+
d2 cB g2 fe | edcB cBAG | BAGE DEGA | B2 G2 G2 :: ga |
|
191 |
+
b2 gb a2 fa | g2 eg edcB | e2 A2 ABcd | edcB A2 ga |
|
192 |
+
b2 gb a2 fa | g2 eg edcB | cBAG DEGA | B2 G2 G2 :|
|
193 |
+
```
|
194 |
+
|
195 |
+
## Training Data
|
196 |
+
|
197 |
+
ChatMusician is pretrained on the π€ [MusicPile](https://huggingface.co/datasets/m-a-p/MusicPile), which is the first pretraining corpus for **developing musical abilities** in large language models. Check out the dataset card for more details.
|
198 |
+
And supervised finetuned on 1.1M samples(2:1 ratio between music scores
|
199 |
+
and music knowledge & music summary data) from MusicPile. Check our [paper](http://arxiv.org/abs/2402.16153) for more details.
|
200 |
|
|
|
|
|
201 |
|
202 |
## Evaluation
|
203 |
|
204 |
1. Music understanding abilities are evaluated on the [MusicTheoryBench](https://huggingface.co/datasets/m-a-p/MusicTheoryBench). The following figure is zero-shot accuracy on MusicTheoryBench.
|
205 |
We included GPT-3.5, GPT-4, LLaMA2-7B-Base, ChatMusician-Base, and ChatMusician. The blue bar represents the performance on the music knowledge metric, and the red bar represents the music reasoning metric. The dashed line corresponds to a random baseline, with a score of 25%.![MusicTheoryBench_result](./MusicTheoryBench_result_plt.png)
|
206 |
+
2. General language abilities of ChatMusician are evaluated on the [Massive Multitask Language Understanding (MMLU) dataset](https://huggingface.co/datasets/lukaemon/mmlu).
|
207 |
|
208 |
|
|
|
209 |
|
|
|
|
|
210 |
|
211 |
## Citation
|
212 |
If you find our work helpful, feel free to give us a cite.
|