|
--- |
|
license: apache-2.0 |
|
language: |
|
- bn |
|
metrics: |
|
- wer |
|
- cer |
|
tags: |
|
- seq2seq |
|
- ipa |
|
- bengali |
|
- byt5 |
|
widget: |
|
- text: <Narail> আমি সে বাবুর মামু বাড়ি গিছিলাম। |
|
example_title: Narail Text |
|
- text: <Rangpur> এখন এই কুলো তার শেষ অই কুলো তার শেষ। |
|
example_title: Rangpur Text |
|
- text: <Chittagong> খয়দে সিআরের এইল্লা কি অবস্থা! |
|
example_title: Chittagong Text |
|
- text: <Kishoreganj> আটাইশ করছিলাম দের কানি ক্ষেত, ইবার মাইর কাইছি। |
|
example_title: Kishoreganj Text |
|
- text: <Narsingdi> তারা তো ওই খারাপ খেইলাই আসে না। |
|
example_title: Narsingdi Text |
|
- text: <Tangail> আর সব থেকে ফানি কথা হইতেছে দেখ? |
|
example_title: Tangail Text |
|
--- |
|
|
|
# Regional bengali text to IPA transcription - umt5-base |
|
|
|
|
|
This is a fine-tuned version of the [google/umt5-base](https://huggingface.co/google/mt5-base) for the task of generating IPA transcriptions from regional bengali text. |
|
This was done on the dataset of the competition [“ভাষামূল: মুখের ভাষার খোঁজে“](https://www.kaggle.com/competitions/regipa/overview) by Bengali.AI. |
|
|
|
Scores achieved till now (test scores): |
|
- **Word error rate (wer)**: 0.27792885899543700 |
|
- **Char error rate (cer)**: 0.05638457089662550 |
|
|
|
Supported district tokens: |
|
- Kishoreganj |
|
- Narail |
|
- Narsingdi |
|
- Chittagong |
|
- Rangpur |
|
- Tangail |
|
|
|
--- |
|
|
|
## Loading & using the model |
|
```python |
|
# Load model directly |
|
from transformers import AutoTokenizer, AutoModelForSeq2SeqLM |
|
tokenizer = AutoTokenizer.from_pretrained("teamapocalypseml/ben2ipa-mt5base") |
|
model = AutoModelForSeq2SeqLM.from_pretrained("teamapocalypseml/ben2ipa-mt5base") |
|
""" |
|
The format of the input text MUST BE: <district> <bengali_text> |
|
""" |
|
text = "<district> bengali_text_here" |
|
text_ids = tokenizer(text, return_tensors='pt').input_ids |
|
model(text_ids) |
|
``` |
|
|
|
|
|
## Using the pipeline |
|
```python |
|
# Use a pipeline as a high-level helper |
|
from transformers import pipeline |
|
device = "cuda" if torch.cuda.is_available() else "cpu" |
|
pipe = pipeline("text2text-generation", model="teamapocalypseml/ben2ipa-mt5base", device=device) |
|
""" |
|
`texts` must be in the format of: <district> <contents> |
|
""" |
|
outputs = pipe(texts, max_length=512, batch_size=batch_size) |
|
``` |
|
|
|
## Credits |
|
Done by [S M Jishanul Islam](https://huggingface.co/smji), [Sadia Ahmmed](https://huggingface.co/sadiaahmmed), [Sahid Hossain Mustakim](https://huggingface.co/rhsm15) |
|
|