metadata

license: apache-2.0
datasets:
  - Helsinki-NLP/opus_paracrawl
  - turuta/Multi30k-uk
language:
  - uk
  - en
metrics:
  - bleu
library_name: peft
pipeline_tag: text-generation
base_model: mistralai/Mistral-7B-v0.1
tags:
  - translation
model-index:
  - name: Dragoman
    results:
      - task:
          type: translation
          name: English-Ukrainian Translation
        dataset:
          type: facebook/flores
          name: FLORES-101
          config: eng_Latn-ukr_Cyrl
          split: devtest
        metrics:
          - type: bleu
            value: 32.34
            name: Test BLEU
widget:
  - text: '[INST] who holds this neighborhood? [/INST]'

Dragoman: English-Ukrainian Machine Translation Model

Model Description

The Dragoman is a sentence-level SOTA English-Ukrainian translation model. It's trained using a two-phase pipeline: pretraining on cleaned Paracrawl dataset and unsupervised data selection phase on turuta/Multi30k-uk.

By using a two-phase data cleaning and data selection approach we have achieved SOTA performance on FLORES-101 English-Ukrainian devtest subset with BLEU 32.34.

Model Details

Developed by: Yurii Paniv, Dmytro Chaplynskyi, Nikita Trynus, Volodymyr Kyrylov
Model type: Translation model
Language(s):
- Source Language: English
- Target Language: Ukrainian
License: Apache 2.0

Model Use Cases

We designed this model for sentence-level English -> Ukrainian translation. Performance on multi-sentence texts is not guaranteed, please be aware.

Running the model

# pip install bitsandbytes transformers peft torch
from transformers import AutoTokenizer, AutoModelForCausalLM
import torch

config = PeftConfig.from_pretrained("lang-uk/dragoman")
quant_config = BitsAndBytesConfig(
    load_in_4bit=True,
    bnb_4bit_quant_type="nf4",
    bnb_4bit_compute_dtype=float16,
    bnb_4bit_use_double_quant=False,
)

model = MistralForCausalLM.from_pretrained(
    "mistralai/Mistral-7B-v0.1", quantization_config=quant_config
)
model = PeftModel.from_pretrained(model, "lang-uk/dragoman").to("cuda")
tokenizer = AutoTokenizer.from_pretrained(
    "mistralai/Mistral-7B-v0.1", use_fast=False, add_bos_token=False
)

input_text = "[INST] who holds this neighborhood? [/INST]" # model input should adhere to this format
input_ids = tokenizer(input_text, return_tensors="pt").to("cuda")

outputs = model.generate(**input_ids)
print(tokenizer.decode(outputs[0]))

Training Dataset and Resources

Training code: lang-uk/dragoman
Cleaned Paracrawl: lang-uk/paracrawl_3m
Cleaned Multi30K: lang-uk/multi30k-extended-17k

Benchmark Results against other models on FLORES-101 devset

Model	BLEU $\uparrow$	spBLEU	chrF	chrF++
Finetuned
Dragoman P, 10 beams	30.38	37.93	59.49	56.41
Dragoman PT, 10 beams	32.34	39.93	60.72	57.82
---------------------------------------------	---------------------	-------------	----------	------------
Zero shot and few shot
LLaMa-2-7B 2-shot	20.1	26.78	49.22	46.29
RWKV-5-World-7B 0-shot	21.06	26.20	49.46	46.46
gpt-4 10-shot	29.48	37.94	58.37	55.38
gpt-4-turbo-preview 0-shot	30.36	36.75	59.18	56.19
Google Translate 0-shot	25.85	32.49	55.88	52.48
---------------------------------------------	---------------------	-------------	----------	------------
Pretrained
NLLB 3B, 10 beams	30.46	37.22	58.11	55.32
OPUS-MT, 10 beams	32.2	39.76	60.23	57.38

Citation

TBD