antoniorv6
commited on
Commit
•
bac9f3a
1
Parent(s):
59f3178
Update README.md
Browse files
README.md
CHANGED
@@ -3,5 +3,36 @@ license: mit
|
|
3 |
pipeline_tag: image-to-text
|
4 |
datasets:
|
5 |
-camera_grandstaff
|
6 |
-
|
7 |
-
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
3 |
pipeline_tag: image-to-text
|
4 |
datasets:
|
5 |
-camera_grandstaff
|
6 |
+
tags: [omr, camera_grandstaff]
|
7 |
+
arxiv: 2402.07596
|
8 |
+
---
|
9 |
+
|
10 |
+
# Sheet Music Transformer (base model, fine-tuned on the Camera Grandstaff dataset)
|
11 |
+
|
12 |
+
The SMT model fine-tuned on the _Camera_ GrandStaff dataset for pianoform transcription.
|
13 |
+
The code of the model is hosted in [this repository](https://github.com/antoniorv6/SMT).
|
14 |
+
|
15 |
+
## Model description
|
16 |
+
|
17 |
+
The SMT model consists of a vision encoder (ConvNext) and a text decoder (classic Transformer).
|
18 |
+
Given an image of a music system, the encoder first encodes the image into a tensor of embeddings (of shape batch_size, seq_len, hidden_size), after which the decoder autoregressively generates text, conditioned on the encoding of the encoder.
|
19 |
+
|
20 |
+
<img src="https://github.com/antoniorv6/SMT/raw/master/graphics/SMT.jpg" alt="drawing" width="720"/>
|
21 |
+
|
22 |
+
## Intended uses & limitations
|
23 |
+
|
24 |
+
This model is fine-tuned on the _Camera_ GrandStaff dataset, its use is limited to transcribe pianoform images only.
|
25 |
+
|
26 |
+
### BibTeX entry and citation info
|
27 |
+
|
28 |
+
```bibtex
|
29 |
+
@misc{RiosVila2024,
|
30 |
+
title={Sheet Music Transformer: End-To-End Optical Music Recognition Beyond Monophonic Transcription},
|
31 |
+
author={Antonio Ríos-Vila and Jorge Calvo-Zaragoza and Thierry Paquet},
|
32 |
+
year={2024},
|
33 |
+
eprint={2402.07596},
|
34 |
+
archivePrefix={arXiv},
|
35 |
+
primaryClass={cs.CV},
|
36 |
+
url={https://arxiv.org/abs/2402.07596},
|
37 |
+
}
|
38 |
+
```
|