File size: 2,588 Bytes
c05378f
8664527
ec7f78a
8664527
 
 
 
 
899ac4b
 
 
130ba69
899ac4b
 
 
 
 
 
 
 
 
 
 
 
7e9f5f2
 
899ac4b
 
b4d5d3e
899ac4b
 
91b5a77
8664527
 
91b5a77
8664527
 
6b604fc
 
 
8664527
 
 
 
 
 
 
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
# bengali-t5-base

**bengali-t5-base** is a model trained on the Bengali portion of MT5 dataset. We used the `T5-base` model for this model.

[Flax/Jax Community Week](https://discuss.huggingface.co/t/open-to-the-community-community-week-using-jax-flax-for-nlp-cv/7104), organized by [HuggingFace](https://huggingface.co/) and TPU usage sponsored by Google.

The model is trained on around ~11B tokens (64 size batch, 512 tokens, 350k steps). 

## load tokenizer

```
>>> tokenizer = transformers.AutoTokenizer.from_pretrained("flax-community/bengali-t5-base")
>>> tokenizer.encode("আমি বাংলার গান গাই")
>>> tokenizer.decode([93, 1912, 814, 5995, 3, 1])
```

```
[93, 1912, 814, 5995, 3, 1]
'আমি বাংলার গান গাই </s>'
```

## load model

```
>>> config  = T5Config.from_pretrained("flax-community/bengali-t5-base")
>>> model = FlaxT5ForConditionalGeneration.from_pretrained("flax-community/bengali-t5-base", config=config)
```

The model is trained on `de-noising` objectives followed by the script [here](https://github.com/huggingface/transformers/blob/master/examples/flax/language-modeling/run_t5_mlm_flax.py). Currently This model doesn't have any generation capability. If you want this model to have generation capability, please do a finetuning on `prefix-LM` objective mentioned in the [paper](https://arxiv.org/abs/1910.10683).
Please note that we haven't finetuned the model in any downstream task. If you are finetuning the model in any downstream task, please let us know about it. Shoot us an email (sbmaruf at gmail dot com)

## Proposal
- [Project Proposal](https://discuss.huggingface.co/t/pretrain-t5-from-scratch-in-bengali/7121)

## Participants
- [Ibraheem Muhammad Moosa](https://huggingface.co/ibraheemmoosa)
- [Tasnim Mohiuddin](https://huggingface.co/tasnim)
- [Khalid Saifullah](https://huggingface.co/khalidsaifullaah)
- [Ritobrata Ghosh](https://huggingface.co/ghosh-r)
- [Tahsin Mayeesha](https://tahsin-mayeesha.github.io/)
- [M Saiful Bari](https://huggingface.co/sbmaruf)

## Useful links
- [Community Week timeline](https://discuss.huggingface.co/t/open-to-the-community-community-week-using-jax-flax-for-nlp-cv/7104#summary-timeline-calendar-6)
- [Community Week README](https://github.com/huggingface/transformers/blob/master/examples/research_projects/jax-projects/README.md)
- [Masked Language Modelling example scripts](https://github.com/huggingface/transformers/tree/master/examples/flax/language-modeling)
- [Model Repository](https://huggingface.co/flax-community/roberta-base-als-demo)