File size: 3,069 Bytes
997b707
 
 
 
 
6a51413
 
1ac6165
6a51413
997b707
1ac6165
6a51413
 
 
 
 
43a5f9c
 
6a51413
 
 
 
2fd8c86
 
 
 
 
 
 
 
 
 
6a51413
 
fc173db
 
 
 
 
 
2fd8c86
 
 
6a51413
 
 
 
 
 
 
 
 
 
 
 
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
---
tags:
- music-generation
- transformer
- pytorch
- audio
- music
- piano
license: mit
---
# Compose & Embellish: Piano Performance Generation Pipeline
Trained model weights and training datasets for the paper:
  * Shih-Lun Wu and Yi-Hsuan Yang  
    "[Compose & Embellish: Well-Structured Piano Performance Generation via A Two-Stage Approach](https://arxiv.org/abs/2209.08212)."  
    _Int. Conf. on Acoustics, Speech and Signal Processing (ICASSP)_, 2023

**Note:** Materials here should be used in conjunction with our [model implementation Github repo](https://github.com/slSeanWU/Compose_and_Embellish).

## Model characteristics
### Stage 1: "Compose" model
Generates **melody and chord progression** from scratch.

  - Model backbone: 12-layer Transformer w/ relative positional encoding
  - Num trainable params: 41.3M
  - Token vocabulary: [Revamped MIDI-derived events](https://arxiv.org/abs/2002.00212) (**REMI**) w/ slight modifications
  - Pretraining dataset: subset of [Lakh MIDI full](https://colinraffel.com/projects/lmd/) (**LMD-full**), 14934 songs
    - melody extraction (and data filtering) done by **matching lyrics to tracks**: https://github.com/gulnazaki/lyrics-melody/blob/main/pre-processing/create_dataset.py
    - structural segmentation done with **A\* search**: https://github.com/Dsqvival/hierarchical-structure-analysis
  - Finetuning dataset: subset of [AILabs.tw Pop1K7](https://github.com/YatingMusic/compound-word-transformer) (**Pop1K7**), 1591 songs
    - melody extraction done with **skyline algorithm**: https://github.com/wazenmai/MIDI-BERT/blob/CP/melody_extraction/skyline/analyzer.py
    - structural segmentation done in the same way as pretraining dataset
  - Training sequence length: 2400
### Stage 2: "Embellish" model
Generates **accompaniment, timing and dynamics** conditioned on Stage 1 outputs.
  - `embellish_model_gpt2_pop1k7_loss0.398.bin`
    - Model backbone: 12-layer **GPT-2 Transformer** ([implementation](https://huggingface.co/docs/transformers/en/model_doc/gpt2))
    - Num trainable params: 38.2M
  - `embellish_model_pop1k7_loss0.399.bin` (requires `fast-transformers` package, which is outdated as of Jul. 2024)
    - Model backbone: 12-layer **Performer** ([paper](https://arxiv.org/abs/2009.14794), [implementation](https://github.com/idiap/fast-transformers))
    - Num trainable params: 38.2M
  - Token vocabulary: [Revamped MIDI-derived events](https://arxiv.org/abs/2002.00212) (**REMI**) w/ slight modifications
  - Training dataset: [AILabs.tw Pop1K7](https://github.com/YatingMusic/compound-word-transformer) (**Pop1K7**), 1747 songs
  - Training sequence length: 3072

## BibTex
If you find the materials useful, please consider citing our work:
```
@inproceedings{wu2023compembellish,
  title={{Compose \& Embellish}: Well-Structured Piano Performance Generation via A Two-Stage Approach},
  author={Wu, Shih-Lun and Yang, Yi-Hsuan},
  booktitle={Proc. Int. Conf. on Acoustics, Speech and Signal Processing (ICASSP)},
  year={2023},
  url={https://arxiv.org/pdf/2209.08212.pdf}
}
```