Edit model card

You need to agree to share your contact information to access this model

This repository is publicly accessible, but you have to accept the conditions to access its files and content.

Log in or Sign Up to review the conditions and access this model content.

This is, to the author's knowledge, the first coherent Aromanian translator. It is a NLLB-200-600M model fine-tuned for translating between Aromanian and Romanian, using this dataset.

Read more about AroTranslate at this GitHub repository.

We present the following results:

ron -> rup rup -> ron
BLEU 35.31 54.69
ChrF2++ 61.27 68.87

Note:

  • As Aromanian does not have a standard writing system, please see code below for text normalization.
  • For Romanian text, it is important to use diacritics for best translation results.

How to use the model:

from transformers import AutoModelForSeq2SeqLM, AutoTokenizer, NllbTokenizer
import re

# load model and tokenizer:
model = AutoModelForSeq2SeqLM.from_pretrained('alexjerpelea/NLLB-aromanian-romanian-v1')
tokenizer = tokenizer = AutoTokenizer.from_pretrained('alexjerpelea/NLLB-aromanian-romanian-v1')

# translate function:
def translate(
    text, src_lang='ron_Latn', tgt_lang='rup_Latn',
    a=32, b=3, max_input_length=1024, num_beams=4, **kwargs
):
    tokenizer.src_lang = src_lang
    tokenizer.tgt_lang = tgt_lang
    inputs = tokenizer(
        text, return_tensors='pt', padding=True, truncation=True,
        max_length=max_input_length
    )
    model.eval()
    result = model.generate(
        **inputs.to(model.device),
        forced_bos_token_id=tokenizer.convert_tokens_to_ids(tgt_lang),
        max_new_tokens=int(a + b * inputs.input_ids.shape[1]),
        num_beams=num_beams, **kwargs
    )
    return tokenizer.batch_decode(result, skip_special_tokens=True)


def clean_text(text, lang):
    if isinstance(text, float):
        return text

    # consecutive spaces
    text = re.sub(r'\s+', ' ', text).strip()
    # old romanian î in the middle of the word
    text = re.sub(r'(?<=\w)î(?=\w)', 'â', text)

    if lang == 'ron':
        text = text.replace('Ş', 'Ș')
        text = text.replace('ş', 'ș')
        text = text.replace('Ţ', 'Ț')
        text = text.replace('ţ', 'ț')
    else:
        text = text.replace('ş', 'sh')
        text = text.replace('ș', 'sh')
        text = text.replace('ţ', 'ts')
        text = text.replace('ț', 'ts')
        text = text.replace('Ş', 'Sh')
        text = text.replace('Ș', 'Sh')
        text = text.replace('Ţ', 'Ts')
        text = text.replace('Ț', 'Ts')

        text = text.replace('ľ', 'lj')
        text = text.replace('Ľ', 'L')

        text = text.replace("l'", "lj")
        text = text.replace("l’", "lj")
        text = text.replace("L'", "Lj")
        text = text.replace("L’", "Lj")

        text = text.replace('ḑ', 'dz')
        text = text.replace('Ḑ', 'dz')
        text = text.replace('ḍ', 'dz')
        text = text.replace('Ḍ', 'Dz')

        # TODO: add n'
        text = text.replace('ń', 'nj')
        text = text.replace('Ń', 'Nj')
        text = text.replace('ñ', 'nj')
        text = text.replace('Ñ', 'Nj')

        text = text.replace('ă', 'ã')
        text = text.replace('Â', 'Ã')
        text = text.replace('â', 'ã')
        text = text.replace('Ă', 'Ã')
        text = text.replace('á', 'ã')
        text = text.replace('à', 'ã')
        text = text.replace('Á', 'Ã')
        text = text.replace('À', 'Ã')

        text = text.replace('Î', 'Ã')
        text = text.replace('î', 'ã')

        # weird foreign characters
        text = text.replace('ŭ', 'u')
        text = text.replace('ς', 'c')
        text = text.replace('é', 'e')
        text = text.replace('í', 'i')
        text = text.replace('ū', 'u')
        text = text.replace('ì', 'i')
        text = text.replace('ā', 'a')
        text = text.replace('ĭ', 'i')
        text = text.replace('γ', 'y')
        text = text.replace('ï', 'i')
        text = text.replace('ó', 'o')
        text = text.replace('θ', 'O')

    # for both languages:
    text = text.replace('—', '-')
    text = text.replace('–', '-')
    text = text.replace('…', '...')
    text = text.replace('*', '')
    text = text.replace('<', '')
    text = text.replace('>', '')

    text = text.replace('„', '"')
    text = text.replace('”', '"')
    text = text.replace('“', '"')
    text = text.replace('”', '"')

    text = text.replace('\xa0', '')
    text = text.replace('\ufeff', '')
    text = text.replace('\n', '') 

    return text

# Aromanian to Romanian:
t = '''Trã atsea cãdzu pri mare cripare, shi tutã dzua stãtea ãnvirinat.'''
t = clean_text(t, 'rup')
print(translate(t, 'rup_Latn', 'ron_Latn'))

# Romanian to Aromanian:
t = '''Apoi se opri puțin, o sorbi din ochi, o sărută și - când începu să scâncească, îi cântă iar:'''
t = clean_text(t, 'rup')
print(translate(t, 'rup_Latn', 'ron_Latn'))

License

Creative Commons License
This work is licensed under a Creative Commons Attribution-NonCommercial 4.0 International License. When using this work, please mention its name as "AroTranslate" and the author.

Downloads last month
0
Safetensors
Model size
615M params
Tensor type
F32
·
Inference Examples
This model does not have enough activity to be deployed to Inference API (serverless) yet. Increase its social visibility and check back later, or deploy to Inference Endpoints (dedicated) instead.

Dataset used to train alexjerpelea/NLLB-aromanian-romanian-v1