File size: 8,359 Bytes
5fdbd44
 
af32e3a
 
 
 
 
 
5fdbd44
e7c0ea2
 
 
 
 
 
af32e3a
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
297f084
af32e3a
 
 
297f084
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
af32e3a
 
 
 
297f084
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115
116
117
118
119
120
121
122
123
124
125
126
127
128
129
130
131
132
133
134
135
136
137
138
139
140
141
142
143
144
145
146
147
148
149
150
151
152
153
154
155
156
157
158
159
160
161
162
163
164
165
166
167
168
169
170
171
172
173
174
175
176
177
178
179
180
181
182
183
184
---
license: apache-2.0
language:
- en
pipeline_tag: text-generation
tags:
- music
- art
---

<div align="center">
    <img src="Yi_logo.svg" width="150px" style="display: inline-block;">
    <img src="m-a-p.png" width="150px" style="display: inline-block;">
</div>

## SMuPT: Symbolic Music Generative Pre-trained Transformer

SMuPT is a series of pre-trained models for symbolic music generation. It was trained on a large-scale dataset of symbolic music, including millions of monophonic and polyphonic pieces from different genres and styles. The models are trained with the LLama2 architecture, and can be further used for downstream music generation tasks such as melody generation, accompaniment generation, and multi-track music generation. 

- 09/01/2024: a series of pre-trained SMuPT models are released, with parameters ranging from 110M to 1.3B.

## Model architecture

The details of model architecture of SMuPT-v0 are listed below:

| Name | Parameters | Training Data(Music Pieces) | Seq Length | Hidden Size | Layers | Heads |
| :--- | :---: | :---: | :---: | :---: | :---: | :---: |
| SMuPT-v0-8192-110M | 110M | 7M x 5.8 epochs | 8192 | 768 | 12 | 12 |
| SMuPT-v0-8192-345M | 345M | 7M x 4 epochs | 8192 | 1024 | 24 | 16 |
| SMuPT-v0-8192-770M | 770M | 7M x 3 epochs | 8192 | 1280 | 36 | 20 |
| SMuPT-v0-8192-1.3B | 1.3B | 7M x 2.2 epochs | 8192 | 1536 | 48 | 24 |

## Model Usage

There are several ways to use our pre-trained SMuPT models, we now the usage based on [Megatron-LM](https://github.com/NVIDIA/Megatron-LM/tree/main). Huggingface format will be supported soon.

Before starting, make sure you have setup the relevant environment and codebase. 
 
```shell
# pull Megatron-LM codebase
mkdir -p /path/to/workspace && cd /path/to/workspace
git clone https://github.com/NVIDIA/Megatron-LM.git

# download the pre-trained SMuPT models checkpoint and vocab files from Huggingface page
mkdir -p /models/SMuPT_v0_8192_1.3B && cd /models/SMuPT_v0_8192_1.3B
wget -O model_optim_rng.pt https://huggingface.co/m-a-p/SMuPT_v0_8192_1.3B/resolve/main/model_optim_rng.pt?download=true
wget -O newline.vocab https://huggingface.co/m-a-p/SMuPT_v0_8192_1.3B/resolve/main/newline.vocab?download=true
wget -O newline.txt https://huggingface.co/m-a-p/SMuPT_v0_8192_1.3B/resolve/main/newline.txt?download=true
```

We recommend using the latest version of [NGC's PyTorch container](https://catalog.ngc.nvidia.com/orgs/nvidia/containers/pytorch) for SMuPT inference. See more details in [Megatron-LM](https://github.com/NVIDIA/Megatron-LM/tree/main)

```shell
# pull the latest NGC's PyTorch container, mount the workspace directory and enter the container
docker run --gpus all -it --name megatron --shm-size=16g -v $PWD:/workspace -p 5000:5000 nvcr.io/nvidia/pytorch:23.11-py3 /bin/bash
```

Once you enter the container, you can start a REST server for inference. 

<details>
    <summary>Click to expand the example script</summary>

    #!/bin/bash
    # This example will start serving the 1.3B model.
    export CUDA_DEVICE_MAX_CONNECTIONS=1

    DISTRIBUTED_ARGS="--nproc_per_node 1 \
                    --nnodes 1 \
                    --node_rank 0 \
                    --master_addr localhost \
                    --master_port 6000"

    CHECKPOINT=/path/to/model/checkpoint/folder
    VOCAB_FILE=/path/to/vocab/file
    MERGE_FILE=/path/to/merge/file

    MODEL_SIZE="1.3B"
    if   [[ ${MODEL_SIZE} == "110M" ]];   then HIDDEN_SIZE=768;  NUM_HEAD=12; NUM_QUERY_GROUP=12; NUM_LAYERS=12; FFN_HIDDEN_SIZE=3072; NORM_EPS=1e-5;
    elif [[ ${MODEL_SIZE} == "345M" ]];   then HIDDEN_SIZE=1024;  NUM_HEAD=16; NUM_QUERY_GROUP=16; NUM_LAYERS=24; FFN_HIDDEN_SIZE=4096; NORM_EPS=1e-5;
    elif [[ ${MODEL_SIZE} == "770M" ]];   then HIDDEN_SIZE=1280;  NUM_HEAD=20; NUM_QUERY_GROUP=20; NUM_LAYERS=36; FFN_HIDDEN_SIZE=5120; NORM_EPS=1e-5;
    elif [[ ${MODEL_SIZE} == "1.3B" ]];   then HIDDEN_SIZE=1536;  NUM_HEAD=24; NUM_QUERY_GROUP=24; NUM_LAYERS=48; FFN_HIDDEN_SIZE=6144; NORM_EPS=1e-5;
    else echo "invalid MODEL_SIZE: ${MODEL_SIZE}"; exit 1
    fi
    MAX_SEQ_LEN=8192
    MAX_POSITION_EMBEDDINGS=8192

    pip install flask-restful

    torchrun $DISTRIBUTED_ARGS tools/run_text_generation_server.py   \
        --tensor-model-parallel-size 1  \
        --pipeline-model-parallel-size 1  \
        --num-layers ${NUM_LAYERS}  \
        --hidden-size ${HIDDEN_SIZE}  \
        --ffn-hidden-size ${FFN_HIDDEN_SIZE} \
        --load ${CHECKPOINT}  \
        --group-query-attention \
        --num-query-groups ${NUM_QUERY_GROUP} \
        --position-embedding-type rope \
        --num-attention-heads ${NUM_HEAD}  \
        --max-position-embeddings ${MAX_POSITION_EMBEDDINGS}  \
        --tokenizer-type GPT2BPETokenizer  \
        --normalization RMSNorm \
        --norm-epsilon ${NORM_EPS} \
        --make-vocab-size-divisible-by 1 \
        --swiglu \
        --use-flash-attn \
        --bf16  \
        --micro-batch-size 1  \
        --disable-bias-linear \
        --no-bias-gelu-fusion \
        --untie-embeddings-and-output-weights \
        --seq-length ${MAX_SEQ_LEN}  \
        --vocab-file $VOCAB_FILE  \
        --merge-file $MERGE_FILE  \
        --attention-dropout 0.0 \
        --hidden-dropout 0.0 \
        --weight-decay 1e-1 \
        --clip-grad 1.0 \
        --adam-beta1 0.9 \
        --adam-beta2 0.95 \
        --adam-eps 1e-8 \
        --seed 42

</details>


Use CURL to query the server directly, note that the newline token `\n` is represented by `<n>` in the vocabulary, so we need to replace the newline token with `<n>` in both the prompt and the generated tokens. 

```shell
curl 'http://localhost:6000/api' -X 'PUT' -H 'Content-Type: application/json; charset=UTF-8'  -d '{"prompts":["X:1<n>T:Music21 Fragment<n>T:Music21 Fragment<n>T:Music21<n>C:Music21<n>%%score 1 2 3 4<n>L:1/4<n>M:4/4<n>K:C<n>V:1 treble nm=\"Piano\" snm=\"Pno\"<n>%%MIDI program 0<n>%%MIDI control 7 100<n>%%MIDI control 10 64<n>V:2 treble nm=\"Piano\" snm=\"Pno\"<n>%%MIDI channel 3<n>%%MIDI program 0<n>%%MIDI control 7 100<n>%%MIDI control 10 64<n>V:3 bass nm=\"Piano\" snm=\"Pno\"<n>%%MIDI channel 4<n>%%MIDI program 0<n>%%MIDI control 7 100<n>%%MIDI control 10 64<n>V:4 bass nm=\"Piano\" snm=\"Pno\"<n>%%MIDI channel 5<n>%%MIDI program 0<n>%%MIDI control 7 100<n>%%MIDI control 10 64<n>V:1<n> z3 c | B A G F/E/ | !fermata!E ^F ^G A | B e d c | !fermata!B2 G A | B c d e | %6<n> !fermata!c2"], "tokens_to_generate":4096}'
```
Output:
```shell
X:1
T:Music21 Fragment
T:Music21 Fragment
T:Music21
C:Music21
%%score 1 2 3 4
L:1/4
M:4/4
K:C
V:1 treble nm="Piano" snm="Pno"
%%MIDI program 0
%%MIDI control 7 100
%%MIDI control 10 64
V:2 treble nm="Piano" snm="Pno"
%%MIDI channel 3
%%MIDI program 0
%%MIDI control 7 100
%%MIDI control 10 64
V:3 bass nm="Piano" snm="Pno"
%%MIDI channel 4
%%MIDI program 0
%%MIDI control 7 100
%%MIDI control 10 64
V:4 bass nm="Piano" snm="Pno"
%%MIDI channel 5
%%MIDI program 0
%%MIDI control 7 100
%%MIDI control 10 64
V:1
 z3 c | B A G F/E/ | !fermata!E ^F ^G A | B e d c | !fermata!B2 G A | B c d e | %6
 !fermata!c2 e | a g/f/ e d | !fermata!e2- e e | a e a g/f/ | e2 !fermata!d g | d d g e |1
 e2 !fermata!e G :|3 e2 !fermata!e2- || e z z2 | z4 |]
V:2
 z3 G | G E C B, | !fermata!C D E E | G G ^F A | !fermata!^G2 E E | G G G G | sG !fermata!B2 G |
 c B B A | !fermata!^G2- G G | A B c d | c2 !fermata!B _B | A A _A A |1 G2 A !fermata!G :|3
 G2 !fermata!A2- || A z z2 | z4 |]
V:3
 z3 C,/D,/ | E, E, E, ^G,, | !fermata!A,, B,, B,, C,/D,/ | E, C, D, ^F, | !fermata!E,2 C, C,/D,/ |
 E, C, B,, C, | !fermata!G, !fermata!G,,2 C,/D,/ | E, E, F, D, | !fermata!E,2- E, D, |
 C, B,, A,, B,, | C,3/2 D,/ !fermata!G,, E, | F, F, _E, B,, |1 _B,, G,, ^F,, !fermata!G,, :|3
 _B,, G,, !fermata!^F,,2- || F,, z z2 | z4 |]
V:4
 z3 C,/B,,/ | A,, E,, A,, E,, | !fermata!A,, G,, E,, A,, | G,, C, B,, D,, |
 !fermata!E,,2 E,,/F,,/ A,, | G,, C, G,, C, | !fermata!B,, !fermata!G,,2 C,/B,,/ | A,, E,, F,, D,, |
 !fermata!E,,2- E,,/^F,,/ ^G,, | A,, ^G,, A,, B,, | C,3/2 D,/ !fermata!G,, E,, |
 F,,/G,,/ A,, B,, E,, |1 G,, C,, ^F,, !fermata!G,, :|3 G,, C,, !fermata!^F,,2- || F,, z z2 | z4 |]
```

Once you encode the generated tokens into audio, you will hear the following music.

<audio controls src="https://cdn-uploads.huggingface.co/production/uploads/640701cb4dc5f2846c91d4eb/cDaJ19RPkVZ_mSdzxAI-D.mpga"></audio>