✨ Today, we're excited to share the full data processing script used in developing our Sailor models. The repo provides an end-to-end data processing pipeline for LLM training. 🚀
The pipeline consists of 4 stages🧹: 1️⃣ Initial data cleaning 2️⃣ Near deduplication 3️⃣ Exact deduplication 4️⃣ Second round of data cleaning
A special focus was given to the data cleaning part of South-East Asian (SEA) languages🌍
# Use Case ✨
With this codebase, you can clean your own dataset with:
✅ Get filtered data counts after each processing stage ✅ Easily configure language-specific cleaning rules (we support Arabic, Bengali, Catalan, Spanish, Basque, French, Hindi, Portuguese, Urdu, and optimize for English, Indonesian, Vietnamese, Chinese, Thai, Lao, Malay) ✅ Investigate what data was removed at each processing stage
# Acknowledgement 🙏
The main credit goes to @dreamerdeo , the first author of our Sailor paper ❤️! He put in tremendous effort on the data processing pipeline, enabling the model's great performance. We believe the mini repo will be a valuable resource for researchers working on dataset curation for large language models. 🎉
Sharing the recipe openly aligns with our commitment to open language model development. 💪 And this repo would not have been possible without the contributions from the open community, including the BigScience data cleaning tool, the all-in-one deduplication tool by @chenghao , and the deduplication project from Google. 🧠
# What's Next 🚀
Share your thoughts or leave any comments on what you'd like the Sailor models to do! We also have some exciting news coming soon, and please stay tuned. 🚄