---
library_name: transformers
license: cc-by-nc-4.0
datasets:
- tahrirchi/dilmash
tags:
- nllb
- karakalpak
language:
- en
- ru
- uz
- kaa
base_model: facebook/nllb-200-distilled-600M
pipeline_tag: translation
---
# Dilmash: Karakalpak Machine Translation Models

This repository contains a collection of machine translation models for the Karakalpak language, developed as part of the research paper "Open Language Data Initiative: Advancing Low-Resource Machine Translation for Karakalpak".

## Model variations

We provide three variants of our Karakalpak translation model:

| Model | Tokenizer Length | Parameter Count | Unique Features |
|-------|------------|-------------------|-----------------|
| [`dilmash-raw`](https://huggingface.co/tahrirchi/dilmash-raw) | 256,204 | 615M | Original NLLB tokenizer |
| [`dilmash`](https://huggingface.co/tahrirchi/dilmash) | 269,399 | 629M | Expanded tokenizer |
| [**`dilmash-TIL`**](https://huggingface.co/tahrirchi/dilmash-TIL) | **269,399** | **629M** | **Additional TIL corpus** |

**Common attributes:**
- **Base Model:** [nllb-200-600M](https://huggingface.co/facebook/nllb-200-distilled-600M)
- **Primary Dataset:** [Dilmash corpus](https://huggingface.co/datasets/tahrirchi/dilmash)
- **Languages:** Karakalpak, Uzbek, Russian, English

## Intended uses & limitations

These models are designed for machine translation tasks involving the Karakalpak language. They can be used for translation between Karakalpak, Uzbek, Russian, or English.

### How to use

You can use these models with the Transformers library. Here's a quick example:

```python
from transformers import AutoTokenizer, AutoModelForSeq2SeqLM

model_ckpt = "tahrirchi/dilmash-til"

tokenizer = AutoTokenizer.from_pretrained(model_ckpt)
model = AutoModelForSeq2SeqLM.from_pretrained(model_ckpt)

# Example translation
input_text = "Here is dilmash translation model."

tokenizer.src_lang = "eng_Latn"
tokenizer.tgt_lang = "kaa_Latn"

inputs = tokenizer(input_text, return_tensors="pt")
outputs = model.generate(**inputs)
translated_text = tokenizer.decode(outputs[0], skip_special_tokens=True)
print(translated_text) # Dilmash awdarması modeli.
```

## Training data

The models were trained on a parallel corpus of 300,000 sentence pairs, including:
- Uzbek-Karakalpak (100,000 pairs)
- Russian-Karakalpak (100,000 pairs)
- English-Karakalpak (100,000 pairs)

The dataset is available [here](https://huggingface.co/datasets/tahrirchi/dilmash).

## Training procedure

For full details of the training procedure, please refer to [our paper](https://arxiv.org/abs/2409.04269).

## Citation

If you use these models in your research, please cite our paper:

```bibtex
@misc{mamasaidov2024openlanguagedatainitiative,
      title={Open Language Data Initiative: Advancing Low-Resource Machine Translation for Karakalpak}, 
      author={Mukhammadsaid Mamasaidov and Abror Shopulatov},
      year={2024},
      eprint={2409.04269},
      archivePrefix={arXiv},
      primaryClass={cs.CL},
      url={https://arxiv.org/abs/2409.04269}, 
}
```

## Gratitude

We are thankful to these awesome organizations and people for helping to make it happen:

 - [David Dalé](https://daviddale.ru): for advise throughout the process
 - Perizad Najimova: for expertise and assistance with the Karakalpak language
 - [Nurlan Pirjanov](https://www.linkedin.com/in/nurlan-pirjanov/): for expertise and assistance with the Karakalpak language
 - [Atabek Murtazaev](https://www.linkedin.com/in/atabek/): for advise throughout the process
 - Ajiniyaz Nurniyazov: for advise throughout the process

We would also like to express our sincere appreciation to [Google for Startups](https://cloud.google.com/startup) for generously sponsoring the compute resources necessary for our experiments. Their support has been instrumental in advancing our research in low-resource language machine translation.


## Contacts

We believe that this work will enable and inspire all enthusiasts around the world to open the hidden beauty of low-resource languages, in particular Karakalpak. 

For further development and issues about the dataset, please use m.mamasaidov@tahrirchi.uz or a.shopolatov@tahrirchi.uz to contact.