|
--- |
|
license: cc-by-sa-4.0 |
|
--- |
|
# SEA-LION-7B-Instruct-C |
|
|
|
SEA-LION is a collection of Large Language Models (LLMs) which has been pretrained and instruct-tuned for the Southeast Asia (SEA) region. |
|
The size of the models range from 3 billion to 7 billion parameters. |
|
This is the card for the SEA-LION 7B Instruct (Commercial) model. |
|
|
|
For more details on the base model, please refer to the [base model's model card](https://huggingface.co/aisingapore/sealion7b). |
|
|
|
SEA-LION stands for <i>Southeast Asian Languages In One Network</i>. |
|
|
|
|
|
## Model Details |
|
|
|
### Model Description |
|
|
|
The SEA-LION model is a significant leap forward in the field of Natural Language Processing, |
|
specifically trained to understand the SEA regional context. |
|
|
|
SEA-LION is built on the robust MPT architecture and has a vocabulary size of 256K. |
|
|
|
For tokenization, the model employs our custom SEABPETokenizer, which is specially tailored for SEA languages, ensuring optimal model performance. |
|
|
|
The pre-training data for the base SEA-LION model encompasses 980B tokens. |
|
The model was then further instruction-tuned on <b>commercially-permissive Indonesian data only</b>. |
|
|
|
- **Developed by:** Products Pillar, AI Singapore |
|
- **Funded by:** Singapore NRF |
|
- **Model type:** Decoder |
|
- **Languages:** English, Chinese, Indonesian, Malay, Thai, Vietnamese, Filipino, Tamil, Burmese, Khmer, Lao |
|
- **License:** CC BY-SA 4.0 License |
|
|
|
### Benchmark Performance |
|
|
|
Coming soon. |
|
|
|
## Technical Specifications |
|
|
|
### Model Architecture and Objective |
|
|
|
SEA-LION is a decoder model using the MPT architecture. |
|
|
|
| Parameter | SEA-LION 7B | |
|
|-----------------|:-----------:| |
|
| Layers | 32 | |
|
| d_model | 4096 | |
|
| head_dim | 32 | |
|
| Vocabulary | 256000 | |
|
| Sequence Length | 2048 | |
|
|
|
|
|
### Tokenizer Details |
|
|
|
We sample 20M lines from the training data to train the tokenizer.<br> |
|
The framework for training is [SentencePiece](https://github.com/google/sentencepiece).<br> |
|
The tokenizer type is Byte-Pair Encoding (BPE). |
|
|
|
### Example Usage |
|
|
|
```python |
|
# Please use transformers==4.34.1 |
|
|
|
from transformers import AutoModelForCausalLM, AutoTokenizer |
|
|
|
tokenizer = AutoTokenizer.from_pretrained("aisingapore/sealion7b-instruct-c", trust_remote_code=True) |
|
model = AutoModelForCausalLM.from_pretrained("aisingapore/sealion7b-instruct-c", trust_remote_code=True) |
|
|
|
prompt_template = "### USER:\n{human_prompt}\n\n### RESPONSE:\n" |
|
prompt = """Apa sentimen dari kalimat berikut ini? |
|
Kalimat: Buku ini sangat membosankan. |
|
Jawaban: """ |
|
full_prompt = prompt_template.format(human_prompt=prompt) |
|
|
|
tokens = tokenizer(full_prompt, return_tensors="pt") |
|
output = model.generate(tokens["input_ids"], max_new_tokens=20, eos_token_id=tokenizer.eos_token_id) |
|
print(tokenizer.decode(output[0], skip_special_tokens=True)) |
|
|
|
``` |
|
|
|
|
|
## The Team |
|
|
|
Lam Wen Zhi Clarence<br> |
|
Leong Wei Qi<br> |
|
Li Yier<br> |
|
Liu Bing Jie Darius<br> |
|
Lovenia Holy<br> |
|
Montalan Jann Railey<br> |
|
Ng Boon Cheong Raymond<br> |
|
Ngui Jian Gang<br> |
|
Nguyen Thanh Ngan<br> |
|
Ong Tat-Wee David<br> |
|
Rengarajan Hamsawardhini<br> |
|
Susanto Yosephine<br> |
|
Tai Ngee Chia<br> |
|
Tan Choon Meng<br> |
|
Teo Jin Howe<br> |
|
Teo Eng Sipp Leslie<br> |
|
Teo Wei Yi<br> |
|
Tjhi William<br> |
|
Yeo Yeow Tong<br> |
|
Yong Xianbin<br> |
|
|
|
## Acknowledgements |
|
|
|
AI Singapore is a national programme supported by the National Research Foundation, Singapore and hosted by the National University of Singapore. |
|
Any opinions, findings and conclusions or recommendations expressed in this material are those of the author(s) and do not reflect the views of National Research Foundation, Singapore. |
|
|
|
## Contact |
|
|
|
For more info, please contact us using this [SEA-LION Inquiry Form](https://forms.gle/sLCUVb95wmGf43hi6) |
|
|
|
[Link to SEA-LION's GitHub repository](https://github.com/aisingapore/sealion) |
|
|
|
## Disclaimer |
|
|
|
This the repository for the commercial instruction-tuned model. |
|
The model has _not_ been aligned for safety. |
|
Developers and users should perform their own safety fine-tuning and related security measures. |
|
In no event shall the authors be held liable for any claim, damages, or other liability |
|
arising from the use of the released weights and codes. |