dotw commited on
Commit
b426412
1 Parent(s): 6c9ccc1

fix data distribution and names

Browse files
Files changed (1) hide show
  1. README.md +24 -25
README.md CHANGED
@@ -39,25 +39,24 @@ SEA-LION was trained on 980B tokens of the following data:
39
 
40
  | Data Source | Tokens | Percentage |
41
  |---------------------------|-------:|:----------:|
42
- | RefinedWeb - English | 571.3B | 62.80% |
43
- | mC4 - Chinese | 91.2B | 10.03% |
44
- | mC4 - Indonesian | 3.6B | 0.40% |
45
- | mC4 - Malay | 0.7B | 0.08% |
46
- | mC4 - Filipino | 1.3B | 0.15% |
47
- | mC4 - Burmese | 1.2B | 0.13% |
48
- | mC4 - Vietnamese | 63.4B | 6.97% |
49
- | mC4 - Thai | 10.8B | 1.19% |
50
- | mC4 - Lao | 0.3B | 0.03% |
51
- | mC4 - Khmer | 0.9B | 0.11% |
52
- | mC4 - Tamil | 2.5B | 0.28% |
53
- | the Stack - Python | 20.9B | 2.30% |
54
- | the Stack - Javascript | 55.6B | 6.11% |
55
- | the Stack - Shell | 1.3B | 0.14% |
56
- | the Stack - SQL | 6.4B | 0.70% |
57
- | the Stack - Markdown | 26.6B | 2.91% |
58
- | RedPajama - StackExchange | 21.2B | 2.33% |
59
- | RedPajama - ArXiv | 30.6B | 3.35% |
60
-
61
 
62
  ### Infrastructure
63
 
@@ -108,26 +107,26 @@ The tokenizer type is Byte-Pair Encoding (BPE).
108
 
109
  ## The Team
110
 
111
- Hamsawardhini Rengarajan<br>
112
  Lam Zhiwen Clarence<br>
113
  Leong Weiqi<br>
114
  Li Yier<br>
115
  Liu Darius<br>
116
  Lovenia Holy<br>
 
117
  Ng Raymond<br>
118
  Ngui Jian Gang<br>
 
119
  Ong Tat-Wee David<br>
120
- Railey Montalan<br>
 
121
  Tai Ngee Chia<br>
122
  Tan Choon Meng<br>
123
- Thanh Ngan Nguyen<br>
124
  Teo Jin Howe<br>
 
125
  Teo Wei Yi<br>
126
- William Tjhi<br>
127
  Yeo Yeow Tong<br>
128
  Yong Xianbin<br>
129
- Yosephine<br>
130
- Leslie Teo<br>
131
 
132
  ## Contact
133
 
 
39
 
40
  | Data Source | Tokens | Percentage |
41
  |---------------------------|-------:|:----------:|
42
+ | RefinedWeb - English | 571.3B | 58.20% |
43
+ | mC4 - Chinese | 91.2B | 9.29% |
44
+ | mC4 - Indonesian | 14.7B | 1.50% |
45
+ | mC4 - Malay | 2.9B | 0.29% |
46
+ | mC4 - Filipino | 5.3B | 0.54% |
47
+ | mC4 - Burmese | 1.2B | 0.49% |
48
+ | mC4 - Vietnamese | 63.4B | 6.46% |
49
+ | mC4 - Thai | 21.6B | 2.20% |
50
+ | mC4 - Lao | 1.1B | 0.12% |
51
+ | mC4 - Khmer | 3.9B | 0.40% |
52
+ | mC4 - Tamil | 10.2B | 1.04% |
53
+ | the Stack - Python | 41.8B | 4.26% |
54
+ | the Stack - Javascript | 55.6B | 5.66% |
55
+ | the Stack - Shell | 2.5B | 0.26% |
56
+ | the Stack - SQL | 12.8B | 1.31% |
57
+ | the Stack - Markdown | 26.6B | 2.71% |
58
+ | RedPajama - StackExchange | 21.2B | 2.16% |
59
+ | RedPajama - ArXiv | 30.6B | 3.12% |
 
60
 
61
  ### Infrastructure
62
 
 
107
 
108
  ## The Team
109
 
 
110
  Lam Zhiwen Clarence<br>
111
  Leong Weiqi<br>
112
  Li Yier<br>
113
  Liu Darius<br>
114
  Lovenia Holy<br>
115
+ Montalan Jann Railey<br>
116
  Ng Raymond<br>
117
  Ngui Jian Gang<br>
118
+ Nguyen Ngan Thanh<br>
119
  Ong Tat-Wee David<br>
120
+ Rengarajan Hamsawardhini<br>
121
+ Susanto Yosephine<br>
122
  Tai Ngee Chia<br>
123
  Tan Choon Meng<br>
 
124
  Teo Jin Howe<br>
125
+ Teo Leslie<br>
126
  Teo Wei Yi<br>
127
+ Tjhi William<br>
128
  Yeo Yeow Tong<br>
129
  Yong Xianbin<br>
 
 
130
 
131
  ## Contact
132