Huiyu (Yvette Chen)

upvoted a collection about 5 hours ago

🔱 Sailor2 Language Models

Collection

Sailing in South-East Asia with Inclusive Multilingual LLMs • 9 items • Updated 9 days ago • 18

upvoted a paper 21 days ago

Prometheus 2: An Open Source Language Model Specialized in Evaluating Other Language Models

Paper • 2405.01535 • Published May 2 • 119

liked 3 models 5 months ago

updated a dataset 5 months ago

sailor2/xcopa

Preview • Updated Jul 2 • 74

reacted to SivilTaram's post with 👍 6 months ago

Post

1724

✨ Today, we're excited to share the full data processing script used in developing our Sailor models. The repo provides an end-to-end data processing pipeline for LLM training. 🚀

💻Code: https://github.com/sail-sg/sailcraft
🤗Model: sail/sailor-language-models-65e19a749f978976f1959825
📜Paper: Sailor: Open Language Models for South-East Asia (2404.03608)
🌐Homepage: https://sailorllm.github.io

# Overview 🔍

The pipeline consists of 4 stages🧹:
1️⃣ Initial data cleaning
2️⃣ Near deduplication
3️⃣ Exact deduplication
4️⃣ Second round of data cleaning

A special focus was given to the data cleaning part of South-East Asian (SEA) languages🌍

# Use Case ✨

With this codebase, you can clean your own dataset with:

✅ Get filtered data counts after each processing stage
✅ Easily configure language-specific cleaning rules (we support Arabic, Bengali, Catalan, Spanish, Basque, French, Hindi, Portuguese, Urdu, and optimize for English, Indonesian, Vietnamese, Chinese, Thai, Lao, Malay)
✅ Investigate what data was removed at each processing stage

# Acknowledgement 🙏

The main credit goes to @dreamerdeo , the first author of our Sailor paper ❤️! He put in tremendous effort on the data processing pipeline, enabling the model's great performance. We believe the mini repo will be a valuable resource for researchers working on dataset curation for large language models. 🎉

Sharing the recipe openly aligns with our commitment to open language model development. 💪 And this repo would not have been possible without the contributions from the open community, including the BigScience data cleaning tool, the all-in-one deduplication tool by @chenghao , and the deduplication project from Google. 🧠

# What's Next 🚀

Share your thoughts or leave any comments on what you'd like the Sailor models to do! We also have some exciting news coming soon, and please stay tuned. 🚄