m-a-p
/

MERT-v1-330M

+---
+license: mit
+inference: false
+tags:
+- music
+---
+# Introduction to our series work
+The development log of our Music Audio Pre-training (m-a-p) model family:
+- 17/03/2023: we release two advanced music understanding models, [MERT-v1-95M](https://huggingface.co/m-a-p/MERT-v1-95M) and [MERT-v1-330M](https://huggingface.co/m-a-p/MERT-v1-330M) , trained with new paradigm and dataset. They outperform the previous models and can better generalize to more tasks.
+- 14/03/2023: we retrained the MERT-v0 model with open-source-only music dataset [MERT-v0-public](https://huggingface.co/m-a-p/MERT-v0-public)
+- 29/12/2022: a music understanding model [MERT-v0](https://huggingface.co/m-a-p/MERT-v0) trained with **MLM** paradigm, which performs better at downstream tasks.
+- 29/10/2022: a pre-trained MIR model [music2vec](https://huggingface.co/m-a-p/music2vec-v1) trained with **BYOL** paradigm.
+Here is a table for quick model pick-up:
+| Name                                                         | Pre-train Paradigm | Training Data (hour) | Pre-train Context   (second) | Model Size | Transformer Layer-Dimension | Feature Rate | Sample Rate | Release Date |
+| ------------------------------------------------------------ | ------------------ | -------------------- | ---------------------------- | ---------- | --------------------------- | ------------ | ----------- | ------------ |
+| [MERT-v1-330M](https://huggingface.co/m-a-p/MERT-v1-330M)    | MLM                | 160K                 | 5                            | 330M       | 24-1024                     | 75 Hz        | 24K Hz      | 17/03/2023   |
+| [MERT-v1-95M](https://huggingface.co/m-a-p/MERT-v1-95M)      | MLM                | 20K                  | 5                            | 95M        | 12-768                      | 75 Hz        | 24K Hz      | 17/03/2023   |
+| [MERT-v0-public](https://huggingface.co/m-a-p/MERT-v0-public) | MLM                | 900                  | 5                            | 95M        | 12-768                      | 50 Hz        | 16K Hz      | 14/03/2023   |
+| [MERT-v0](https://huggingface.co/m-a-p/MERT-v0)              | MLM                | 1000                 | 5                            | 95 M       | 12-768                      | 50 Hz        | 16K Hz      | 29/12/2023   |
+| [music2vec-v1](https://huggingface.co/m-a-p/music2vec-v1)    | BYOL               | 1000                 | 30                           | 95 M       | 12-768                      | 50 Hz        | 16K Hz      | 30/10/2022   |
+## Explanation
+The m-a-p models share the similar model architecture and the most distinguished difference is the paradigm in used pre-training. Other than that, there are several nuance technical configuration needs to know before using:
+- **Model Size**: the number of parameters that would be loaded to memory. Please select the appropriate size fitting your hardware.
+- **Transformer Layer-Dimension**: The number of transformer layers and the corresponding feature dimensions can be outputted from our model. This is marked out because features extracted by **different layers could have various performance depending on tasks**.
+- **Feature Rate**: Given a 1-second audio input, the number of features output by the model.
+- **Sample Rate**: The frequency of audio that the model is trained with.
+# Introduction to MERT-v1
+Compared to MERT-v0, we introduce multiple new things in the MERT-v1 pre-training:
+- Change the pseudo labels to 8 codebooks from [encodec](https://github.com/facebookresearch/encodec), which potentially has higher quality and empower our model to support music generation.
+- MLM prediction with in-batch noise mixture.
+- Train with higher audio frequency (24K Hz).
+- Train with more audio data (up to 160 thousands of hours).
+- More available model sizes 95M and 330M.
+More details will be written in our coming-soon paper.
+# Model Usage
+```python
+from transformers import Wav2Vec2Processor
+from transformers import AutoModel
+import torch
+from torch import nn
+from datasets import load_dataset
+# load demo audio and set processor
+dataset = load_dataset("hf-internal-testing/librispeech_asr_demo", "clean", split="validation")
+dataset = dataset.sort("id")
+sampling_rate = dataset.features["audio"].sampling_rate
+processor = Wav2Vec2Processor.from_pretrained("facebook/hubert-large-ls960-ft")
+# loading our model weights
+commit_hash='7bab7bb5d8b52448eff4873a980dc17f0015a09c'# this is recommended for security reason, the hash might be updated
+model = AutoModel.from_pretrained("m-a-p/MERT-v1-330M", trust_remote_code=True, revision=commit_hash)
+# audio file is decoded on the fly
+inputs = processor(dataset[0]["audio"]["array"], sampling_rate=sampling_rate, return_tensors="pt")
+with torch.no_grad():
+    outputs = model(**inputs, output_hidden_states=True)
+# take a look at the output shape, there are 25 layers of representation
+# each layer performs differently in different downstream tasks, you should choose empirically
+all_layer_hidden_states = torch.stack(outputs.hidden_states).squeeze()
+print(all_layer_hidden_states.shape) # [25 layer, Time steps, 1024 feature_dim]
+# for utterance level classification tasks, you can simply reduce the representation in time
+time_reduced_hidden_states = all_layer_hidden_states.mean(-2)
+print(time_reduced_hidden_states.shape) # [25, 1024]
+# you can even use a learnable weighted average representation
+aggregator = nn.Conv1d(in_channels=13, out_channels=1, kernel_size=1)
+weighted_avg_hidden_states = aggregator(time_reduced_hidden_states.unsqueeze(0)).squeeze()
+print(weighted_avg_hidden_states.shape) # [1024]
+```
+# Citation
+```shell
+@article{li2022large,
+  title={Large-Scale Pretrained Model for Self-Supervised Music Audio Representation Learning},
+  author={Li, Yizhi and Yuan, Ruibin and Zhang, Ge and Ma, Yinghao and Lin, Chenghua and Chen, Xingran and Ragni, Anton and Yin, Hanzhi and Hu, Zhijie and He, Haoyu and others},
+  year={2022}
+}
+```