Sailor2

community

AI & ML interests

Open language models for South-East Asia

Sailor2

The Sailor2 community is to build open large language models optimized for multiple South-East Asian languages, such as Cebuano, Indonesian, Khmer, Lao, Minangkabau, Malay, Burmese, Sundanese, Javanese, Thai, and Vietnamese. The model will be continually pre-trained on a base model proficient in both Chinese and English, and its performance is expected to be comparable to the most advanced business models for the above South-East Asian languages.

Read more details about Sailor2 at https://sea-sailor.github.io/blog/sailor2/.


🔱 Sailor2 Models
  • GitHub: All you need to know about using or fine-tuning Sailor2.
  • Sailor2-1B: 1B base model continually pre-trained on 500B tokens from Qwen2.5-0.5B with model expansion.
  • Sailor2-8B: 8B base model continually pre-trained on 500B tokens from Qwen2.5-7B with model expansion.
  • Sailor2-20B: 20B base model continually pre-trained on 500B tokens from Qwen2.5-14B with model expansion.
  • Sailor2-1B-Chat: 1B chat model after post-training on the 1B base model.
  • Sailor2-8B-Chat: 8B chat model after post-training on the 8B base model.
  • Sailor2-20B-Chat: 20B chat model after post-training on the 20B base model.

📚 Sailor2 Pre-training Dataset

📑 Sailor2 Post-training Dataset
  • sailor2-sft-stage1: Medium-Quality Instruction tuning dataset, supports English, Chinese and 16 SEA languages.
  • sailor2-sft-stage2: High-Quality Instruction tuning dataset, supports English, Chinese and 16 SEA languages.
  • sea-ultrafeedback: Preference optimization dataset, supports English, Chinese and 17 SEA languages.

🧐 Sailor2 Evaluation Dataset
  • sea-wildbench: Chat model evaluation, supports 8 SEA languages.

💻 Sailor2 Codebase

models

None public yet