aisingapore
/

sea-lion-7b-instruct

@@ -1,5 +1,5 @@
 ---
-license: cc-by-sa-4.0
 ---
 # SEA-LION-7B-Instruct-C
@@ -24,18 +24,40 @@ SEA-LION is built on the robust MPT architecture and has a vocabulary size of 25
 For tokenization, the model employs our custom SEABPETokenizer, which is specially tailored for SEA languages, ensuring optimal model performance.
 The pre-training data for the base SEA-LION model encompasses 980B tokens.
-The model was then further instruction-tuned on <b>commercially-permissive Indonesian data only</b>.
 - **Developed by:** Products Pillar, AI Singapore
 - **Funded by:** Singapore NRF
 - **Model type:** Decoder
 - **Languages:** English, Chinese, Indonesian, Malay, Thai, Vietnamese, Filipino, Tamil, Burmese, Khmer, Lao
-- **License:** CC BY-SA 4.0 License
 ### Benchmark Performance
 Coming soon.
 ## Technical Specifications
 ### Model Architecture and Objective
@@ -50,35 +72,15 @@ SEA-LION is a decoder model using the MPT architecture.
 | Vocabulary      | 256000      |
 | Sequence Length | 2048        |
 ### Tokenizer Details
 We sample 20M lines from the training data to train the tokenizer.<br>
 The framework for training is [SentencePiece](https://github.com/google/sentencepiece).<br>
 The tokenizer type is Byte-Pair Encoding (BPE).
-### Example Usage
-```python
-# Please use transformers==4.34.1
-from transformers import AutoModelForCausalLM, AutoTokenizer
-tokenizer = AutoTokenizer.from_pretrained("aisingapore/sealion7b-instruct-c", trust_remote_code=True)
-model = AutoModelForCausalLM.from_pretrained("aisingapore/sealion7b-instruct-c", trust_remote_code=True)
-prompt_template = "### USER:\n{human_prompt}\n\n### RESPONSE:\n"
-prompt = """Apa sentimen dari kalimat berikut ini?
-Kalimat: Buku ini sangat membosankan.
-Jawaban: """
-full_prompt = prompt_template.format(human_prompt=prompt)
-tokens = tokenizer(full_prompt, return_tensors="pt")
-output = model.generate(tokens["input_ids"], max_new_tokens=20, eos_token_id=tokenizer.eos_token_id)
-print(tokenizer.decode(output[0], skip_special_tokens=True))
-```
 ## The Team

 ---
+license: mit
 ---
 # SEA-LION-7B-Instruct-C
 For tokenization, the model employs our custom SEABPETokenizer, which is specially tailored for SEA languages, ensuring optimal model performance.
 The pre-training data for the base SEA-LION model encompasses 980B tokens.
+The model was then further instruction-tuned on a mixture of <b>commercially-permissive English and Indonesian data</b>.
 - **Developed by:** Products Pillar, AI Singapore
 - **Funded by:** Singapore NRF
 - **Model type:** Decoder
 - **Languages:** English, Chinese, Indonesian, Malay, Thai, Vietnamese, Filipino, Tamil, Burmese, Khmer, Lao
+- **License:** MIT License
 ### Benchmark Performance
 Coming soon.
+### Usage and limitations
+SEA-LION can be run using the 🤗 Transformers library
+```python
+# Please use transformers==4.37.2
+from transformers import AutoModelForCausalLM, AutoTokenizer
+tokenizer = AutoTokenizer.from_pretrained("aisingapore/sealion7b-instruct-c", trust_remote_code=True)
+model = AutoModelForCausalLM.from_pretrained("aisingapore/sealion7b-instruct-c", trust_remote_code=True)
+prompt_template = "### USER:\n{human_prompt}\n\n### RESPONSE:\n"
+prompt = """Apa sentimen dari kalimat berikut ini?
+Kalimat: Buku ini sangat membosankan.
+Jawaban: """
+full_prompt = prompt_template.format(human_prompt=prompt)
+tokens = tokenizer(full_prompt, return_tensors="pt")
+output = model.generate(tokens["input_ids"], max_new_tokens=20, eos_token_id=tokenizer.eos_token_id)
+print(tokenizer.decode(output[0], skip_special_tokens=True))
+```
 ## Technical Specifications
 ### Model Architecture and Objective
 | Vocabulary      | 256000      |
 | Sequence Length | 2048        |
 ### Tokenizer Details
 We sample 20M lines from the training data to train the tokenizer.<br>
 The framework for training is [SentencePiece](https://github.com/google/sentencepiece).<br>
 The tokenizer type is Byte-Pair Encoding (BPE).
+### Training Details
+Coming soon.
 ## The Team