Sailor2
community
AI & ML interests
Open language models for South-East Asia
Organization Card
Sailor2
The Sailor2 community is to build open large language models optimized for multiple South-East Asian languages, such as Cebuano, Indonesian, Khmer, Lao, Minangkabau, Malay, Burmese, Sundanese, Javanese, Thai, and Vietnamese. The model will be continually pre-trained on a base model proficient in both Chinese and English, and its performance is expected to be comparable to the most advanced business models for the above South-East Asian languages.
Read more details about Sailor2 at https://sea-sailor.github.io/blog/sailor2/.
🔱 Sailor2 Models
- GitHub: All you need to know about using or fine-tuning Sailor2.
- Sailor2-1B: 1B base model continually pre-trained on 500B tokens from Qwen2.5-0.5B with model expansion.
- Sailor2-8B: 8B base model continually pre-trained on 500B tokens from Qwen2.5-7B with model expansion.
- Sailor2-20B: 20B base model continually pre-trained on 500B tokens from Qwen2.5-14B with model expansion.
- Sailor2-1B-Chat: 1B chat model after post-training on the 1B base model.
- Sailor2-8B-Chat: 8B chat model after post-training on the 8B base model.
- Sailor2-20B-Chat: 20B chat model after post-training on the 20B base model.
📚 Sailor2 Pre-training Dataset
- Sailor2-pretrain-data-stage1: 500B high quality data for model training
- Sailor2-pretrain-data-stage2: 50B extra high quality data for model annealing
- sea-commoncrawl: Cleaned and deduplicated commoncrawl
- sea-internet: Cleaned multilingual data from Internet Archive
- sea-pdf-text: Cleaned pdf data
- sea-synthetic: Translation dataset from Cosmopedia across multiple languages
- sea-commoncrawl-high-quality: extra cleaned and deduplicated commoncrawl
📑 Sailor2 Post-training Dataset
- sailor2-sft-stage1: Medium-Quality Instruction tuning dataset, supports English, Chinese and 16 SEA languages.
- sailor2-sft-stage2: High-Quality Instruction tuning dataset, supports English, Chinese and 16 SEA languages.
- sea-ultrafeedback: Preference optimization dataset, supports English, Chinese and 17 SEA languages.
🧐 Sailor2 Evaluation Dataset
- sea-wildbench: Chat model evaluation, supports 8 SEA languages.
💻 Sailor2 Codebase
- SailCraft Code: Data cleaning
- Regmix Code: Data mixture
- SailCompass Code: Few-shot evaluation
- Megatron Code: Pretraining-training (Coming Soon)
- OAT Code: Post-training
models
None public yet
datasets
14
sailor2/sailor2-pretrain-data-stage2
Viewer
•
Updated
•
51.7M
sailor2/sailor2-pretrain-data-stage1
Viewer
•
Updated
•
295M
sailor2/sailor2-sft-stage2
Viewer
•
Updated
•
531k
sailor2/sailor2-sft-stage1
Viewer
•
Updated
•
2.73M
sailor2/sea-wildbench
Viewer
•
Updated
•
1.02k
•
68
sailor2/sea-ultrafeedback
Viewer
•
Updated
•
58.5k
•
3
sailor2/sea-commoncrawl-high-quality
Viewer
•
Updated
•
17.4M
•
5
sailor2/community-dataset
Viewer
•
Updated
•
5.17M
•
1
sailor2/sea-pdf-text
Viewer
•
Updated
•
32.4M
•
3
sailor2/sea-commoncrawl
Viewer
•
Updated
•
494M
•
119