File size: 8,022 Bytes
4628f57
 
db58aec
 
 
 
 
 
 
 
 
751cc82
637597c
ce1b57f
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
8347dbc
557f12d
4628f57
db58aec
83fc491
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
a2f7ac1
c97ed7e
 
a2f7ac1
 
 
 
c97ed7e
 
a2f7ac1
0011f8d
 
a2f7ac1
0011f8d
 
a2f7ac1
 
5651294
a2f7ac1
 
 
0011f8d
 
a2f7ac1
c97ed7e
 
83fc491
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
ea85d69
83fc491
99a40c4
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
83fc491
 
 
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115
116
117
118
119
120
121
122
123
124
125
126
127
128
129
130
131
132
133
134
135
136
137
138
139
140
141
142
143
144
145
146
147
148
149
150
151
152
153
154
155
156
157
158
159
160
161
162
163
164
165
166
167
168
169
170
171
---
license: apache-2.0
datasets:
- Helsinki-NLP/opus_paracrawl
- turuta/Multi30k-uk
language:
- uk
- en
metrics:
- bleu
library_name: peft
pipeline_tag: text-generation
base_model: mistralai/Mistral-7B-v0.1
tags:
- translation
model-index:
- name: Dragoman
  results:
  - task:
      type: translation             # Required. Example: automatic-speech-recognition
      name: English-Ukrainian Translation             # Optional. Example: Speech Recognition
    dataset:
      type: facebook/flores          # Required. Example: common_voice. Use dataset id from https://hf.co/datasets
      name: FLORES-101          # Required. A pretty name for the dataset. Example: Common Voice (French)
      config: eng_Latn-ukr_Cyrl      # Optional. The name of the dataset configuration used in `load_dataset()`. Example: fr in `load_dataset("common_voice", "fr")`. See the `datasets` docs for more info: https://huggingface.co/docs/datasets/package_reference/loading_methods#datasets.load_dataset.name
      split: devtest        # Optional. Example: test
    metrics:
      - type: bleu         # Required. Example: wer. Use metric id from https://hf.co/metrics
        value: 32.34       # Required. Example: 20.90
        name: Test BLEU         # Optional. Example: Test WER
widget:
- text: "[INST] who holds this neighborhood? [/INST]"
---

# Dragoman: English-Ukrainian Machine Translation Model

## Model Description

The Dragoman is a sentence-level SOTA English-Ukrainian translation model. It's trained using a two-phase pipeline: pretraining on cleaned [Paracrawl](https://huggingface.co/datasets/Helsinki-NLP/opus_paracrawl) dataset and unsupervised data selection phase on [turuta/Multi30k-uk](https://huggingface.co/datasets/turuta/Multi30k-uk).

By using a two-phase data cleaning and data selection approach we have achieved SOTA performance on FLORES-101 English-Ukrainian devtest subset with **BLEU** `32.34`.


## Model Details

- **Developed by:** Yurii Paniv, Dmytro Chaplynskyi, Nikita Trynus, Volodymyr Kyrylov 
- **Model type:** Translation model
- **Language(s):**  
  - Source Language: English
  - Target Language: Ukrainian
- **License:** Apache 2.0
  
## Model Use Cases

We designed this model for sentence-level English -> Ukrainian translation.
Performance on multi-sentence texts is not guaranteed, please be aware.


#### Running the model


```python
# pip install bitsandbytes transformers peft torch
from transformers import AutoTokenizer, AutoModelForCausalLM
import torch

config = PeftConfig.from_pretrained("lang-uk/dragoman")
quant_config = BitsAndBytesConfig(
    load_in_4bit=True,
    bnb_4bit_quant_type="nf4",
    bnb_4bit_compute_dtype=float16,
    bnb_4bit_use_double_quant=False,
)

model = MistralForCausalLM.from_pretrained(
    "mistralai/Mistral-7B-v0.1", quantization_config=quant_config
)
model = PeftModel.from_pretrained(model, "lang-uk/dragoman").to("cuda")
tokenizer = AutoTokenizer.from_pretrained(
    "mistralai/Mistral-7B-v0.1", use_fast=False, add_bos_token=False
)

input_text = "[INST] who holds this neighborhood? [/INST]" # model input should adhere to this format
input_ids = tokenizer(input_text, return_tensors="pt").to("cuda")

outputs = model.generate(**input_ids)
print(tokenizer.decode(outputs[0]))
```

### Running the model with mlx-lm on an Apple computer


We merged Dragoman PT adapter into the base model and uploaded the quantized version of the model into https://huggingface.co/lang-uk/dragoman-4bit.

You can run the model using [mlx-lm](https://pypi.org/project/mlx-lm/).


```
python -m mlx_lm.generate --model lang-uk/dragoman-4bit --prompt '[INST] who holds this neighborhood? [/INST]' --temp 0 --max-tokens 100
```

MLX is a recommended way of using the language model on an Apple computer with an M1 chip and newer.


### Running the model with llama.cpp

We converted Dragoman PT adapter into the [GGLA format](https://huggingface.co/lang-uk/dragoman/blob/main/ggml-adapter-model.bin).

You can download the [Mistral-7B-v0.1 base model in the GGUF format](https://huggingface.co/TheBloke/Mistral-7B-v0.1-GGUF) (e.g. mistral-7b-v0.1.Q4_K_M.gguf)
and use `ggml-adapter-model.bin` from this repository like this:

```
./main -ngl 32 -m mistral-7b-v0.1.Q4_K_M.gguf --color -c 4096 --temp 0 --repeat_penalty 1.1 -n -1 -p "[INST] who holds this neighborhood? [/INST]" --lora ./ggml-adapter-model.bin
```

### Training Dataset and Resources

Training code: [lang-uk/dragoman](https://github.com/lang-uk/dragoman)  
Cleaned Paracrawl: [lang-uk/paracrawl_3m](https://huggingface.co/datasets/lang-uk/paracrawl_3m)  
Cleaned Multi30K: [lang-uk/multi30k-extended-17k](https://huggingface.co/datasets/lang-uk/multi30k-extended-17k)



### Benchmark Results against other models on FLORES-101 devset


| **Model**                                   | **BLEU** $\uparrow$ | **spBLEU** | **chrF** | **chrF++** |
|---------------------------------------------|---------------------|-------------|----------|------------|
| **Finetuned**                               |                     |             |          |            |
| Dragoman P, 10 beams                        | 30.38               | 37.93       | 59.49    | 56.41      |
| Dragoman PT, 10 beams                       | **32.34**           | **39.93**   | **60.72**| **57.82**  |
|---------------------------------------------|---------------------|-------------|----------|------------|
| **Zero shot and few shot**                  |                     |             |          |            |
| LLaMa-2-7B 2-shot                           | 20.1                | 26.78       | 49.22    | 46.29      |
| RWKV-5-World-7B 0-shot                      | 21.06               | 26.20       | 49.46    | 46.46      |
| gpt-4 10-shot                               | 29.48               | 37.94       | 58.37    | 55.38      |
| gpt-4-turbo-preview 0-shot                  | 30.36               | 36.75       | 59.18    | 56.19      |
| Google Translate 0-shot                     | 25.85               | 32.49       | 55.88    | 52.48      |
|---------------------------------------------|---------------------|-------------|----------|------------|
| **Pretrained**                              |                     |             |          |            |
| NLLB 3B, 10 beams                           | 30.46               | 37.22       | 58.11    | 55.32      |
| OPUS-MT, 10 beams                           | 32.2                | 39.76       | 60.23    | 57.38      |


## Citation

```
@inproceedings{paniv-etal-2024-dragoman,
    title = "Setting up the Data Printer with Improved {E}nglish to {U}krainian Machine Translation",
    author = "Paniv, Yurii  and
      Chaplynskyi, Dmytro  and
      Trynus, Nikita  and
      Kyrylov, Volodymyr",
    editor = "Romanyshyn, Mariana  and
      Romanyshyn, Nataliia  and
      Hlybovets, Andrii  and
      Ignatenko, Oleksii",
    booktitle = "Proceedings of the Third Ukrainian Natural Language Processing Workshop (UNLP) @ LREC-COLING 2024",
    month = may,
    year = "2024",
    address = "Torino, Italia",
    publisher = "ELRA and ICCL",
    url = "https://aclanthology.org/2024.unlp-1.6",
    pages = "41--50",
    abstract = "To build large language models for Ukrainian we need to expand our corpora with large amounts of new algorithmic tasks expressed in natural language. Examples of task performance expressed in English are abundant, so with a high-quality translation system our community will be enabled to curate datasets faster. To aid this goal, we introduce a recipe to build a translation system using supervised finetuning of a large pretrained language model with a noisy parallel dataset of 3M pairs of Ukrainian and English sentences followed by a second phase of training using 17K examples selected by k-fold perplexity filtering on another dataset of higher quality. Our decoder-only model named Dragoman beats performance of previous state of the art encoder-decoder models on the FLORES devtest set.",
}
```