xianbin commited on
Commit
10f60d3
1 Parent(s): 3f96a16

Update README.md

Browse files
Files changed (1) hide show
  1. README.md +27 -25
README.md CHANGED
@@ -1,5 +1,5 @@
1
  ---
2
- license: cc-by-sa-4.0
3
  ---
4
  # SEA-LION-7B-Instruct-C
5
 
@@ -24,18 +24,40 @@ SEA-LION is built on the robust MPT architecture and has a vocabulary size of 25
24
  For tokenization, the model employs our custom SEABPETokenizer, which is specially tailored for SEA languages, ensuring optimal model performance.
25
 
26
  The pre-training data for the base SEA-LION model encompasses 980B tokens.
27
- The model was then further instruction-tuned on <b>commercially-permissive Indonesian data only</b>.
28
 
29
  - **Developed by:** Products Pillar, AI Singapore
30
  - **Funded by:** Singapore NRF
31
  - **Model type:** Decoder
32
  - **Languages:** English, Chinese, Indonesian, Malay, Thai, Vietnamese, Filipino, Tamil, Burmese, Khmer, Lao
33
- - **License:** CC BY-SA 4.0 License
34
 
35
  ### Benchmark Performance
36
 
37
  Coming soon.
38
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
39
  ## Technical Specifications
40
 
41
  ### Model Architecture and Objective
@@ -50,35 +72,15 @@ SEA-LION is a decoder model using the MPT architecture.
50
  | Vocabulary | 256000 |
51
  | Sequence Length | 2048 |
52
 
53
-
54
  ### Tokenizer Details
55
 
56
  We sample 20M lines from the training data to train the tokenizer.<br>
57
  The framework for training is [SentencePiece](https://github.com/google/sentencepiece).<br>
58
  The tokenizer type is Byte-Pair Encoding (BPE).
59
 
60
- ### Example Usage
61
-
62
- ```python
63
- # Please use transformers==4.34.1
64
-
65
- from transformers import AutoModelForCausalLM, AutoTokenizer
66
-
67
- tokenizer = AutoTokenizer.from_pretrained("aisingapore/sealion7b-instruct-c", trust_remote_code=True)
68
- model = AutoModelForCausalLM.from_pretrained("aisingapore/sealion7b-instruct-c", trust_remote_code=True)
69
-
70
- prompt_template = "### USER:\n{human_prompt}\n\n### RESPONSE:\n"
71
- prompt = """Apa sentimen dari kalimat berikut ini?
72
- Kalimat: Buku ini sangat membosankan.
73
- Jawaban: """
74
- full_prompt = prompt_template.format(human_prompt=prompt)
75
-
76
- tokens = tokenizer(full_prompt, return_tensors="pt")
77
- output = model.generate(tokens["input_ids"], max_new_tokens=20, eos_token_id=tokenizer.eos_token_id)
78
- print(tokenizer.decode(output[0], skip_special_tokens=True))
79
-
80
- ```
81
 
 
82
 
83
  ## The Team
84
 
 
1
  ---
2
+ license: mit
3
  ---
4
  # SEA-LION-7B-Instruct-C
5
 
 
24
  For tokenization, the model employs our custom SEABPETokenizer, which is specially tailored for SEA languages, ensuring optimal model performance.
25
 
26
  The pre-training data for the base SEA-LION model encompasses 980B tokens.
27
+ The model was then further instruction-tuned on a mixture of <b>commercially-permissive English and Indonesian data</b>.
28
 
29
  - **Developed by:** Products Pillar, AI Singapore
30
  - **Funded by:** Singapore NRF
31
  - **Model type:** Decoder
32
  - **Languages:** English, Chinese, Indonesian, Malay, Thai, Vietnamese, Filipino, Tamil, Burmese, Khmer, Lao
33
+ - **License:** MIT License
34
 
35
  ### Benchmark Performance
36
 
37
  Coming soon.
38
 
39
+ ### Usage and limitations
40
+ SEA-LION can be run using the 🤗 Transformers library
41
+ ```python
42
+ # Please use transformers==4.37.2
43
+
44
+ from transformers import AutoModelForCausalLM, AutoTokenizer
45
+
46
+ tokenizer = AutoTokenizer.from_pretrained("aisingapore/sealion7b-instruct-c", trust_remote_code=True)
47
+ model = AutoModelForCausalLM.from_pretrained("aisingapore/sealion7b-instruct-c", trust_remote_code=True)
48
+
49
+ prompt_template = "### USER:\n{human_prompt}\n\n### RESPONSE:\n"
50
+ prompt = """Apa sentimen dari kalimat berikut ini?
51
+ Kalimat: Buku ini sangat membosankan.
52
+ Jawaban: """
53
+ full_prompt = prompt_template.format(human_prompt=prompt)
54
+
55
+ tokens = tokenizer(full_prompt, return_tensors="pt")
56
+ output = model.generate(tokens["input_ids"], max_new_tokens=20, eos_token_id=tokenizer.eos_token_id)
57
+ print(tokenizer.decode(output[0], skip_special_tokens=True))
58
+
59
+ ```
60
+
61
  ## Technical Specifications
62
 
63
  ### Model Architecture and Objective
 
72
  | Vocabulary | 256000 |
73
  | Sequence Length | 2048 |
74
 
 
75
  ### Tokenizer Details
76
 
77
  We sample 20M lines from the training data to train the tokenizer.<br>
78
  The framework for training is [SentencePiece](https://github.com/google/sentencepiece).<br>
79
  The tokenizer type is Byte-Pair Encoding (BPE).
80
 
81
+ ### Training Details
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
82
 
83
+ Coming soon.
84
 
85
  ## The Team
86