p-alonso commited on
Commit
3385fe0
1 Parent(s): e788782

Update README.md

Browse files
Files changed (1) hide show
  1. README.md +173 -0
README.md CHANGED
@@ -1,3 +1,176 @@
1
  ---
2
  license: cc-by-nc-sa-4.0
 
 
 
 
3
  ---
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1
  ---
2
  license: cc-by-nc-sa-4.0
3
+ metrics:
4
+ - roc_auc
5
+ library_name: transformers
6
+ pipeline_tag: audio-classification
7
  ---
8
+ # Model Card for discogs-maest-30s-pw-73e-ts
9
+
10
+ ## Model Details
11
+
12
+ MAEST is a family of Transformer models based on [PASST](https://github.com/kkoutini/PaSST) and
13
+ focused on music analysis applications.
14
+ The MAEST models are also available for inference only as part of the
15
+ [Essentia](https://essentia.upf.edu/models.html#MAEST) library, and in the [official repository](https://github.com/palonso/MAEST).
16
+
17
+
18
+ ### Model Description
19
+
20
+ <!-- Provide a longer summary of what this model is. -->
21
+
22
+ - **Developed by:** Pablo Alonso
23
+ - **Shared by:** Pablo Alonso
24
+ - **Model type:** Transformer
25
+ - **License:** cc-by-nc-sa-4.0
26
+ - **Finetuned from model:** [PaSST](https://github.com/kkoutini/PaSST)
27
+
28
+ ### Model Sources
29
+
30
+ <!-- Provide the basic links for the model. -->
31
+
32
+ - **Repository:** [MAEST](https://github.com/palonso/MAEST)
33
+ - **Paper:** [Efficient Supervised Training of Audio Transformers for Music Representation Learning](http://hdl.handle.net/10230/58023)
34
+
35
+ ## Uses
36
+
37
+ <!-- Address questions around how the model is intended to be used, including the foreseeable users of the model and those affected by the model. -->
38
+
39
+ MAEST is a music audio representation model pre-trained on the task of music style classification.
40
+ According to the evaluation reported in the original paper, it reports good performance in several downstream music analysis tasks.
41
+
42
+ ### Direct Use
43
+
44
+ <!-- This section is for the model use without fine-tuning or plugging into a larger ecosystem/app. -->
45
+
46
+ The MAEST models can make predictions for a taxonomy of 400 music styles derived from the public metadata of [Discogs](https://www.discogs.com/).
47
+
48
+ ### Downstream Use
49
+
50
+ <!-- This section is for the model use when fine-tuned for a task, or when plugged into a larger ecosystem/app -->
51
+
52
+ The MAEST models have reported good performance in downstream applications related to music genre recognition, music emotion recognition, and instrument detection.
53
+ Specifically, the original paper reports that the best performance is obtained from representations extracted from intermediate layers of the model.
54
+
55
+ ### Out-of-Scope Use
56
+
57
+ <!-- This section addresses misuse, malicious use, and uses that the model will not work well for. -->
58
+
59
+ The model has not been evaluated outside the context of music understanding applications, so we are unaware of its performance outside its intended domain.
60
+ Since the model is intended to be used within the `audio-classification` pipeline, it is important to mention that MAEST is **NOT** a general-purpose audio classification model (such as [AST](https://huggingface.co/docs/transformers/model_doc/audio-spectrogram-transformer)), so it shuold not be expected to perform well in tasks such as [AudioSet](https://research.google.com/audioset/dataset/index.html).
61
+
62
+ ## Bias, Risks, and Limitations
63
+
64
+ <!-- This section is meant to convey both technical and sociotechnical limitations. -->
65
+
66
+ The MAEST models were trained using Discogs20, an in-house [MTG](https://www.upf.edu/web/mtg) dataset derived from the public Discogs metadata. While we tried to maximize the diversity with respect to the 400 music styles covered in the dataset, we noted an overrepresentation of Western (particularly electronic) music.
67
+
68
+ ## How to Get Started with the Model
69
+
70
+ The MAEST models can be used with the `audio_classification` pipeline of the `transformers` library. For example:
71
+
72
+ ```python
73
+ import numpy as np
74
+ from transformers import pipeline
75
+
76
+ # audio @16kHz
77
+ audio = np.random.randn(30 * 16000)
78
+
79
+ pipe = pipeline("audio-classification", model="mtg-upf/discogs-maest-30s-pw-73e-ts")
80
+ pipe(audio)
81
+ ```
82
+
83
+ ```
84
+ [{'score': 0.6158794164657593, 'label': 'Electronic---Noise'},
85
+ {'score': 0.08825448155403137, 'label': 'Electronic---Experimental'},
86
+ {'score': 0.08772594481706619, 'label': 'Electronic---Abstract'},
87
+ {'score': 0.03644488751888275, 'label': 'Rock---Noise'},
88
+ {'score': 0.03272806480526924, 'label': 'Electronic---Musique Concrète'}]
89
+ ```
90
+
91
+ ## Training Details
92
+
93
+ ### Training Data
94
+
95
+ <!-- This should link to a Data Card, perhaps with a short stub of information on what the training data is all about as well as documentation related to data pre-processing or additional filtering. -->
96
+
97
+ Our models were trained using Discogs20, [MTG](https://www.upf.edu/web/mtg) in-house dataset featuring 3.3M music tracks matched to [Discogs](https://www.discogs.com/)' metadata.
98
+
99
+ ### Training Procedure
100
+
101
+ <!-- This relates heavily to the Technical Specifications. Content here should link to that section when it is relevant to the training procedure. -->
102
+
103
+ Most training details are detailed in the [paper](http://hdl.handle.net/10230/58023) and [official implementation](https://github.com/palonso/MAEST/) of the model.
104
+
105
+ #### Preprocessing
106
+
107
+ MAEST models rely on mel-spectrograms originally extracted with the Essentia library, and used in several previous publications.
108
+ In Transformers, this mel-spectrogram signature is replicated to a certain extent using `audio_utils`, which have a very small (but not neglectable) impact on the predictions.
109
+
110
+ ## Evaluation, Metrics, and results
111
+
112
+ The MAEST models were pre-trained in the task of music style classification, and their internal representations were evaluated via downstream MLP probes in several benchmark music understanding tasks.
113
+ Check the original [paper](http://hdl.handle.net/10230/58023) for details.
114
+
115
+
116
+ ## Environmental Impact
117
+
118
+ <!-- Total emissions (in grams of CO2eq) and additional considerations, such as electricity usage, go here. Edit the suggested text below accordingly -->
119
+
120
+ - **Hardware Type:** 4 x Nvidia RTX 2080 Ti
121
+ - **Hours used:** apprx. 32
122
+ - **Carbon Emitted:** apprx. 3.46 kg CO2 eq.
123
+
124
+ *Carbon emissions estimated using the [Machine Learning Impact calculator](https://mlco2.github.io/impact#compute) presented in [Lacoste et al. (2019)](https://arxiv.org/abs/1910.09700).*
125
+
126
+ ## Technical Specifications
127
+
128
+ ### Model Architecture and Objective
129
+
130
+ [Audio Spectrogram Transformer (AST)](https://huggingface.co/docs/transformers/model_doc/audio-spectrogram-transformer)
131
+
132
+ ### Compute Infrastructure
133
+
134
+ Local infrastructure
135
+
136
+ #### Hardware
137
+
138
+ 4 x Nvidia RTX 2080 Ti
139
+
140
+ #### Software
141
+
142
+ Pytorch
143
+
144
+ ## Citation
145
+
146
+ <!-- If there is a paper or blog post introducing the model, the APA and Bibtex information for that should go in this section. -->
147
+
148
+ **BibTeX:**
149
+
150
+ ```
151
+ @inproceedings{alonso2023music,
152
+ title={Efficient supervised training of audio transformers for music representation learning},
153
+ author={Alonso-Jim{\'e}nez, Pablo and Serra, Xavier and Bogdanov, Dmitry},
154
+ booktitle={Proceedings of the 24th International Society for Music Information Retrieval Conference (ISMIR 2023)},
155
+ year={2022},
156
+ organization={International Society for Music Information Retrieval (ISMIR)}
157
+ }
158
+ ```
159
+
160
+ **APA:**
161
+
162
+ ```
163
+ Alonso-Jiménez, P., Serra, X., & Bogdanov, D. (2023). Efficient Supervised Training of Audio Transformers for Music Representation Learning. In Proceedings of the 24th International Society for Music Information Retrieval Conference (ISMIR 2023)
164
+ ```
165
+
166
+ ## Model Card Authors
167
+
168
+ Pablo Alonso
169
+
170
+ ## Model Card Contact
171
+
172
+ * Twitter: [@pablo__alonso](https://twitter.com/pablo__alonso)
173
+
174
+ * Github: [@palonso](https://github.com/palonso/)
175
+
176
+ * mail: pablo `dot` alonso `at` upf `dot` edu