File size: 1,405 Bytes
d968503
 
 
 
 
2f9a1a9
d968503
 
 
 
fccfcee
 
d968503
 
 
a788b47
 
fccfcee
a788b47
 
 
d968503
a788b47
fccfcee
a788b47
0a3fcbc
a788b47
 
 
 
fccfcee
 
 
 
 
a788b47
 
 
 
 
 
 
 
 
 
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
---
language:
- vie
pipeline_tag: text-generation

Trained: Fine-tuning
Config file: 2.7B
---
# Model Card for Model ID

This model is pretrained and fine-tuned with Vietnamese language, based on GPT-NeoX which is a large language model developed by EleutherAI. 


## Model Details

### Training Data
- **Pre-train:** 
Culturax Vietnamese Dataset(450GB) + AI-Hub Vietnamese Dataset(1.3GB) + Crawled Vietnamese Wikipedia Dataset(630MB) + viwik18 Dataset(1.27GB)
- **Fine-tuning:** 
12MB Vietnamese Question & Answer dataset  
Vietnamese Alpaca(16412 rows) + Vietnamese QA Dataset based on viwik18(14293 rows)

### Training Hardware
Trained on A100 40GB GPU and 48 core CPU. Took 18 hours to reach 10 epochs.

### Hyperparameters
<figure style="width:30em">

| Hyperparameter         | Value       |
| ---------------------- | ----------- |
| num_train_epochs       | 2670182400  |
| train_batch_size       | 2           |
| learning_rate          | 0.0001      |
| warmup_steps           | 1000        |
| weight_decay           | 0           |
</figure>

### How to use 
 The model can be loaded using the `AutoModelForCausalLM` functionality:
```python
from transformers import AutoTokenizer, AutoModelForCausalLM

tokenizer = AutoTokenizer.from_pretrained("eunyounglee/GPT-NeoX-2.7B-Vietnamese-finetune")
model = AutoModelForCausalLM.from_pretrained("eunyounglee/GPT-NeoX-2.7B-Vietnamese-finetune")
```