Text Generation
PEFT
Safetensors
Ukrainian
English
translation
Eval Results
dragoman / README.md
robinhad's picture
Update README.md
ea85d69 verified
|
raw
history blame
5.41 kB
metadata
license: apache-2.0
datasets:
  - Helsinki-NLP/opus_paracrawl
  - turuta/Multi30k-uk
language:
  - uk
  - en
metrics:
  - bleu
library_name: peft
pipeline_tag: text-generation
base_model: mistralai/Mistral-7B-v0.1
tags:
  - translation
model-index:
  - name: Dragoman
    results:
      - task:
          type: translation
          name: English-Ukrainian Translation
        dataset:
          type: facebook/flores
          name: FLORES-101
          config: eng_Latn-ukr_Cyrl
          split: devtest
        metrics:
          - type: bleu
            value: 32.34
            name: Test BLEU
widget:
  - text: '[INST] who holds this neighborhood? [/INST]'

Dragoman: English-Ukrainian Machine Translation Model

Model Description

The Dragoman is a sentence-level SOTA English-Ukrainian translation model. It's trained using a two-phase pipeline: pretraining on cleaned Paracrawl dataset and unsupervised data selection phase on turuta/Multi30k-uk.

By using a two-phase data cleaning and data selection approach we have achieved SOTA performance on FLORES-101 English-Ukrainian devtest subset with BLEU 32.34.

Model Details

  • Developed by: Yurii Paniv, Dmytro Chaplynskyi, Nikita Trynus, Volodymyr Kyrylov
  • Model type: Translation model
  • Language(s):
    • Source Language: English
    • Target Language: Ukrainian
  • License: Apache 2.0

Model Use Cases

We designed this model for sentence-level English -> Ukrainian translation. Performance on multi-sentence texts is not guaranteed, please be aware.

Running the model

# pip install bitsandbytes transformers peft torch
from transformers import AutoTokenizer, AutoModelForCausalLM
import torch

config = PeftConfig.from_pretrained("lang-uk/dragoman")
quant_config = BitsAndBytesConfig(
    load_in_4bit=True,
    bnb_4bit_quant_type="nf4",
    bnb_4bit_compute_dtype=float16,
    bnb_4bit_use_double_quant=False,
)

model = MistralForCausalLM.from_pretrained(
    "mistralai/Mistral-7B-v0.1", quantization_config=quant_config
)
model = PeftModel.from_pretrained(model, "lang-uk/dragoman").to("cuda")
tokenizer = AutoTokenizer.from_pretrained(
    "mistralai/Mistral-7B-v0.1", use_fast=False, add_bos_token=False
)

input_text = "[INST] who holds this neighborhood? [/INST]" # model input should adhere to this format
input_ids = tokenizer(input_text, return_tensors="pt").to("cuda")

outputs = model.generate(**input_ids)
print(tokenizer.decode(outputs[0]))

Training Dataset and Resources

Training code: lang-uk/dragoman
Cleaned Paracrawl: lang-uk/paracrawl_3m
Cleaned Multi30K: lang-uk/multi30k-extended-17k

Benchmark Results against other models on FLORES-101 devset

Model BLEU $\uparrow$ spBLEU chrF chrF++
Finetuned
Dragoman P, 10 beams 30.38 37.93 59.49 56.41
Dragoman PT, 10 beams 32.34 39.93 60.72 57.82
--------------------------------------------- --------------------- ------------- ---------- ------------
Zero shot and few shot
LLaMa-2-7B 2-shot 20.1 26.78 49.22 46.29
RWKV-5-World-7B 0-shot 21.06 26.20 49.46 46.46
gpt-4 10-shot 29.48 37.94 58.37 55.38
gpt-4-turbo-preview 0-shot 30.36 36.75 59.18 56.19
Google Translate 0-shot 25.85 32.49 55.88 52.48
--------------------------------------------- --------------------- ------------- ---------- ------------
Pretrained
NLLB 3B, 10 beams 30.46 37.22 58.11 55.32
OPUS-MT, 10 beams 32.2 39.76 60.23 57.38

Citation

TBD