dilmash / README.md

changing citation and some minor changes

d52e5de verified 2 months ago

4.21 kB

	---
	library_name: transformers
	license: cc-by-nc-4.0
	datasets:
	- tahrirchi/dilmash
	tags:
	- nllb
	- karakalpak
	language:
	- en
	- ru
	- uz
	- kaa
	base_model: facebook/nllb-200-distilled-600M
	pipeline_tag: translation
	---
	# Dilmash: Karakalpak Machine Translation Models

	This repository contains a collection of machine translation models for the Karakalpak language, developed as part of the research paper "Open Language Data Initiative: Advancing Low-Resource Machine Translation for Karakalpak".

	## Model variations

	We provide three variants of our Karakalpak translation model:

	\| Model \| Tokenizer Length \| Parameter Count \| Unique Features \|
	\|-------\|------------\|-------------------\|-----------------\|
	\| [`dilmash-raw`](https://huggingface.co/tahrirchi/dilmash-raw) \| 256,204 \| 615M \| Original NLLB tokenizer \|
	\| [`dilmash`](https://huggingface.co/tahrirchi/dilmash) \| 269,399 \| 629M \| Expanded tokenizer \|
	\| [`dilmash-TIL`](https://huggingface.co/tahrirchi/dilmash-TIL) \| 269,399 \| 629M \| Additional TIL corpus \|

	Common attributes:
	- Base Model: [nllb-200-600M](https://huggingface.co/facebook/nllb-200-distilled-600M)
	- Primary Dataset: [Dilmash corpus](https://huggingface.co/datasets/tahrirchi/dilmash)
	- Languages: Karakalpak, Uzbek, Russian, English

	## Intended uses & limitations

	These models are designed for machine translation tasks involving the Karakalpak language. They can be used for translation between Karakalpak, Uzbek, Russian, or English.

	### How to use

	You can use these models with the Transformers library. Here's a quick example:

	```python
	from transformers import AutoTokenizer, AutoModelForSeq2SeqLM

	model_ckpt = "tahrirchi/dilmash"

	tokenizer = AutoTokenizer.from_pretrained(model_ckpt)
	model = AutoModelForSeq2SeqLM.from_pretrained(model_ckpt)

	# Example translation
	input_text = "Here is dilmash translation model."

	tokenizer.src_lang = "eng_Latn"
	tokenizer.tgt_lang = "kaa_Latn"

	inputs = tokenizer(input_text, return_tensors="pt")
	outputs = model.generate(**inputs)
	translated_text = tokenizer.decode(outputs[0], skip_special_tokens=True)
	print(translated_text) # Dilmash awdarması modeli.
	```

	## Training data

	The models were trained on a parallel corpus of 300,000 sentence pairs, including:
	- Uzbek-Karakalpak (100,000 pairs)
	- Russian-Karakalpak (100,000 pairs)
	- English-Karakalpak (100,000 pairs)

	The dataset is available [here](https://huggingface.co/datasets/tahrirchi/dilmash).

	## Training procedure

	For full details of the training procedure, please refer to [our paper](https://arxiv.org/abs/2409.04269).

	## Citation

	If you use these models in your research, please cite our paper:

	```bibtex
	@misc{mamasaidov2024openlanguagedatainitiative,
	title={Open Language Data Initiative: Advancing Low-Resource Machine Translation for Karakalpak},
	author={Mukhammadsaid Mamasaidov and Abror Shopulatov},
	year={2024},
	eprint={2409.04269},
	archivePrefix={arXiv},
	primaryClass={cs.CL},
	url={https://arxiv.org/abs/2409.04269},
	}
	```

	## Gratitude

	We are thankful to these awesome organizations and people for helping to make it happen:

	- [David Dalé](https://daviddale.ru): for advise throughout the process
	- Perizad Najimova: for expertise and assistance with the Karakalpak language
	- [Nurlan Pirjanov](https://www.linkedin.com/in/nurlan-pirjanov/): for expertise and assistance with the Karakalpak language
	- [Atabek Murtazaev](https://www.linkedin.com/in/atabek/): for advise throughout the process
	- Ajiniyaz Nurniyazov: for advise throughout the process

	We would also like to express our sincere appreciation to [Google for Startups](https://cloud.google.com/startup) for generously sponsoring the compute resources necessary for our experiments. Their support has been instrumental in advancing our research in low-resource language machine translation.

	## Contacts

	We believe that this work will enable and inspire all enthusiasts around the world to open the hidden beauty of low-resource languages, in particular Karakalpak.

	For further development and issues about the dataset, please use [email protected] or [email protected] to contact.