Update README.md
Browse files
README.md
CHANGED
@@ -1,5 +1,5 @@
|
|
1 |
---
|
2 |
-
license:
|
3 |
---
|
4 |
# SEA-LION-7B-Instruct-C
|
5 |
|
@@ -24,18 +24,40 @@ SEA-LION is built on the robust MPT architecture and has a vocabulary size of 25
|
|
24 |
For tokenization, the model employs our custom SEABPETokenizer, which is specially tailored for SEA languages, ensuring optimal model performance.
|
25 |
|
26 |
The pre-training data for the base SEA-LION model encompasses 980B tokens.
|
27 |
-
The model was then further instruction-tuned on <b>commercially-permissive Indonesian data
|
28 |
|
29 |
- **Developed by:** Products Pillar, AI Singapore
|
30 |
- **Funded by:** Singapore NRF
|
31 |
- **Model type:** Decoder
|
32 |
- **Languages:** English, Chinese, Indonesian, Malay, Thai, Vietnamese, Filipino, Tamil, Burmese, Khmer, Lao
|
33 |
-
- **License:**
|
34 |
|
35 |
### Benchmark Performance
|
36 |
|
37 |
Coming soon.
|
38 |
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
39 |
## Technical Specifications
|
40 |
|
41 |
### Model Architecture and Objective
|
@@ -50,35 +72,15 @@ SEA-LION is a decoder model using the MPT architecture.
|
|
50 |
| Vocabulary | 256000 |
|
51 |
| Sequence Length | 2048 |
|
52 |
|
53 |
-
|
54 |
### Tokenizer Details
|
55 |
|
56 |
We sample 20M lines from the training data to train the tokenizer.<br>
|
57 |
The framework for training is [SentencePiece](https://github.com/google/sentencepiece).<br>
|
58 |
The tokenizer type is Byte-Pair Encoding (BPE).
|
59 |
|
60 |
-
###
|
61 |
-
|
62 |
-
```python
|
63 |
-
# Please use transformers==4.34.1
|
64 |
-
|
65 |
-
from transformers import AutoModelForCausalLM, AutoTokenizer
|
66 |
-
|
67 |
-
tokenizer = AutoTokenizer.from_pretrained("aisingapore/sealion7b-instruct-c", trust_remote_code=True)
|
68 |
-
model = AutoModelForCausalLM.from_pretrained("aisingapore/sealion7b-instruct-c", trust_remote_code=True)
|
69 |
-
|
70 |
-
prompt_template = "### USER:\n{human_prompt}\n\n### RESPONSE:\n"
|
71 |
-
prompt = """Apa sentimen dari kalimat berikut ini?
|
72 |
-
Kalimat: Buku ini sangat membosankan.
|
73 |
-
Jawaban: """
|
74 |
-
full_prompt = prompt_template.format(human_prompt=prompt)
|
75 |
-
|
76 |
-
tokens = tokenizer(full_prompt, return_tensors="pt")
|
77 |
-
output = model.generate(tokens["input_ids"], max_new_tokens=20, eos_token_id=tokenizer.eos_token_id)
|
78 |
-
print(tokenizer.decode(output[0], skip_special_tokens=True))
|
79 |
-
|
80 |
-
```
|
81 |
|
|
|
82 |
|
83 |
## The Team
|
84 |
|
|
|
1 |
---
|
2 |
+
license: mit
|
3 |
---
|
4 |
# SEA-LION-7B-Instruct-C
|
5 |
|
|
|
24 |
For tokenization, the model employs our custom SEABPETokenizer, which is specially tailored for SEA languages, ensuring optimal model performance.
|
25 |
|
26 |
The pre-training data for the base SEA-LION model encompasses 980B tokens.
|
27 |
+
The model was then further instruction-tuned on a mixture of <b>commercially-permissive English and Indonesian data</b>.
|
28 |
|
29 |
- **Developed by:** Products Pillar, AI Singapore
|
30 |
- **Funded by:** Singapore NRF
|
31 |
- **Model type:** Decoder
|
32 |
- **Languages:** English, Chinese, Indonesian, Malay, Thai, Vietnamese, Filipino, Tamil, Burmese, Khmer, Lao
|
33 |
+
- **License:** MIT License
|
34 |
|
35 |
### Benchmark Performance
|
36 |
|
37 |
Coming soon.
|
38 |
|
39 |
+
### Usage and limitations
|
40 |
+
SEA-LION can be run using the 🤗 Transformers library
|
41 |
+
```python
|
42 |
+
# Please use transformers==4.37.2
|
43 |
+
|
44 |
+
from transformers import AutoModelForCausalLM, AutoTokenizer
|
45 |
+
|
46 |
+
tokenizer = AutoTokenizer.from_pretrained("aisingapore/sealion7b-instruct-c", trust_remote_code=True)
|
47 |
+
model = AutoModelForCausalLM.from_pretrained("aisingapore/sealion7b-instruct-c", trust_remote_code=True)
|
48 |
+
|
49 |
+
prompt_template = "### USER:\n{human_prompt}\n\n### RESPONSE:\n"
|
50 |
+
prompt = """Apa sentimen dari kalimat berikut ini?
|
51 |
+
Kalimat: Buku ini sangat membosankan.
|
52 |
+
Jawaban: """
|
53 |
+
full_prompt = prompt_template.format(human_prompt=prompt)
|
54 |
+
|
55 |
+
tokens = tokenizer(full_prompt, return_tensors="pt")
|
56 |
+
output = model.generate(tokens["input_ids"], max_new_tokens=20, eos_token_id=tokenizer.eos_token_id)
|
57 |
+
print(tokenizer.decode(output[0], skip_special_tokens=True))
|
58 |
+
|
59 |
+
```
|
60 |
+
|
61 |
## Technical Specifications
|
62 |
|
63 |
### Model Architecture and Objective
|
|
|
72 |
| Vocabulary | 256000 |
|
73 |
| Sequence Length | 2048 |
|
74 |
|
|
|
75 |
### Tokenizer Details
|
76 |
|
77 |
We sample 20M lines from the training data to train the tokenizer.<br>
|
78 |
The framework for training is [SentencePiece](https://github.com/google/sentencepiece).<br>
|
79 |
The tokenizer type is Byte-Pair Encoding (BPE).
|
80 |
|
81 |
+
### Training Details
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
82 |
|
83 |
+
Coming soon.
|
84 |
|
85 |
## The Team
|
86 |
|