|
--- |
|
language: |
|
- vi |
|
- lo |
|
tags: |
|
- translation |
|
license: mit |
|
widget: |
|
- text: "Tôi muốn mua một cuốn sách" |
|
inference: |
|
parameters: |
|
max_length: 200 |
|
pipeline_tag: translation |
|
library_name: transformers |
|
--- |
|
# Vietnamese to Lao Translation Model |
|
In the domain of natural language processing (NLP), the development of translation models tailored for low-resource languages represents a critical endeavor to facilitate cross-cultural communication and knowledge exchange. In response to this challenge, we present a novel and impactful contribution: a translation model specifically designed to bridge the linguistic gap between Lao and Vietnamese. |
|
|
|
Lao, a language spoken primarily in Laos and parts of Thailand, presents inherent challenges for machine translation due to its low-resource nature, characterized by limited parallel corpora and linguistic resources. Vietnamese, a language spoken by millions worldwide, shares some linguistic similarities with Lao, making it an ideal target language for translation purposes. |
|
|
|
Leveraging the power of the Transformer-based T5 model, we have developed a robust translation system for the Vietnamese-Lao language pair. The T5 model, renowned for its versatility and effectiveness across various NLP tasks, serves as the cornerstone of our approach. Through fine-tuning on a curated dataset of Lao-Vietnamese parallel texts, we have endeavored to enhance translation accuracy and fluency, thus enabling smoother communication between speakers of these languages. |
|
|
|
Our work represents a significant advancement in the field of machine translation, particularly for low-resource languages like Lao. By harnessing state-of-the-art NLP techniques and focusing on the specific linguistic nuances of the Lao-Vietnamese language pair, we aim to provide a valuable resource for facilitating cross-linguistic communication and cultural exchange. |
|
## How to use |
|
### On GPU |
|
```python |
|
from transformers import AutoTokenizer, AutoModelForSeq2SeqLM |
|
tokenizer = AutoTokenizer.from_pretrained("minhtoan/t5-translate-vietnamese-lao") |
|
model = AutoModelForSeq2SeqLM.from_pretrained("minhtoan/t5-translate-vietnamese-lao") |
|
model.cuda() |
|
src = "Tôi muốn mua một cuốn sách" |
|
tokenized_text = tokenizer.encode(src, return_tensors="pt").cuda() |
|
model.eval() |
|
translate_ids = model.generate(tokenized_text, max_length=200) |
|
output = tokenizer.decode(translate_ids[0], skip_special_tokens=True) |
|
output |
|
``` |
|
'ຂ້ອຍຢາກຊື້ປຶ້ມ' |
|
|
|
### On CPU |
|
```python |
|
from transformers import AutoTokenizer, AutoModelForSeq2SeqLM |
|
tokenizer = AutoTokenizer.from_pretrained("minhtoan/t5-translate-vietnamese-lao") |
|
model = AutoModelForSeq2SeqLM.from_pretrained("minhtoan/t5-translate-vietnamese-lao") |
|
src = "Tôi muốn mua một cuốn sách" |
|
input_ids = tokenizer(src, max_length=200, return_tensors="pt", padding="max_length", truncation=True).input_ids |
|
outputs = model.generate(input_ids=input_ids, max_new_tokens=200) |
|
output = tokenizer.batch_decode(outputs, skip_special_tokens=True)[0] |
|
output |
|
``` |
|
'ຂ້ອຍຢາກຊື້ປຶ້ມ' |
|
|
|
|
|
|
|
## Author |
|
` |
|
Phan Minh Toan |
|
` |