TomPei commited on
Commit
7fd013c
1 Parent(s): 4c6fdcc

Update README.md

Browse files
Files changed (1) hide show
  1. README.md +160 -1
README.md CHANGED
@@ -2,4 +2,163 @@
2
  ---
3
  license: apache-2.0
4
  ---
5
-
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
2
  ---
3
  license: apache-2.0
4
  ---
5
+
6
+ # **csg-wukong-ablation-chinese-random** [[中文]](#chinese) [[English]](#english)
7
+
8
+ <a id="english"></a>
9
+
10
+ <p align="center">
11
+ <img width="600px" alt="OpenCSG" src="./Chinese Fineweb Edu Dataset logo.webp">
12
+ </p>
13
+
14
+
15
+ <p align="center"><a href="https://portal.opencsg.com/models">[OpenCSG Community]</a> <a href="https://github.com/OpenCSGs/Awesome-SLMs">[github]</a> <a href="https://cdn-uploads.huggingface.co/production/uploads/64c71b27d43e4dee51a8b31a/HU6vz21qKTEmUBCWqCFh9.jpeg">[wechat]</a> <a href="https://twitter.com/OpenCsg">[Twitter]</a> </p>
16
+
17
+
18
+ </div>
19
+ **Chinese Fineweb Edu** dataset is a meticulously constructed high-quality Chinese pre-training corpus, specifically designed for natural language processing tasks in the education domain. This dataset undergoes a rigorous selection and deduplication process, using a scoring model trained on a small amount of data for evaluation. From vast amounts of raw data, it extracts high-value education-related content, ensuring the quality and diversity of the data. Ultimately, the dataset contains approximately 90 million high-quality Chinese text entries, with a total size of about 300GB.
20
+
21
+
22
+
23
+
24
+ ## Selection Method
25
+
26
+ During the data selection process, the **Chinese Fineweb Edu** dataset adopted a strategy similar to that of Fineweb-Edu, with a focus on the educational value and content quality of the data. The specific selection steps are as follows:
27
+
28
+ 1. **Educational Value Assessment**: Initially, the csg-wukong-enterprise scoring model was used to evaluate the educational value of the samples. The model provided a score ranging from 0 to 5 based on the relevance and quality of the content. In the preliminary selection phase, we selected approximately 100,000 high-scoring samples.
29
+ 2. **Scoring Model Training**: Using these 100,000 samples, a BERT model was trained to score a larger pre-training dataset. This step ensured that the model could effectively identify content with high educational value.
30
+ 3. **Data Selection**: Next, the trained BERT model was used to comprehensively score the raw data, retaining only data with a score greater than 4. This selection process significantly enhanced the quality and relevance of the dataset, ensuring its applicability in the educational domain.
31
+ 4. **MinHash Deduplication**: To avoid the negative impact of duplicate content on model training, the dataset was deduplicated using the MinHash algorithm. This method ensured the uniqueness of the data while preserving a diverse range of educational content.
32
+
33
+ <p align="center">
34
+ <img width="900px" alt="OpenCSG" src="./Selection Method.png">
35
+ </p>
36
+
37
+
38
+
39
+ ## Original Data Sources
40
+
41
+ The **Chinese Fineweb Edu** dataset is built upon a wide range of original data sources, encompassing several mainstream Chinese pre-training datasets. While these datasets vary in scale and coverage, through meticulous selection and processing, they have collectively laid a solid foundation for the **Chinese Fineweb Edu** dataset. The main data sources include:
42
+
43
+ - [CCI2-Data](https://huggingface.co/datasets/BAAI/CCI2-Data): A high-quality and reliable Chinese safety dataset that has undergone rigorous cleaning, deduplication, and quality filtering processes.
44
+ - [SkyPile-150B](https://huggingface.co/datasets/Skywork/SkyPile-150B): A large-scale dataset with 150 billion tokens sourced from the Chinese internet, processed with complex filtering and deduplication techniques.
45
+ - [IndustryCorpus](https://huggingface.co/datasets/BAAI/IndustryCorpus): A Chinese pre-training dataset covering multiple industries, containing 1TB of Chinese data, particularly suited for industry-specific model training.
46
+ - [Tele-AI](https://huggingface.co/datasets/Tele-AI/TeleChat-PTD): A high-quality, large-scale Chinese dataset extracted from the pre-training corpus of the telecom large language model TeleChat, containing approximately 270 million pure Chinese texts that have been strictly filtered and deduplicated.
47
+ - [MAP-CC](https://huggingface.co/datasets/m-a-p/MAP-CC): A massive Chinese pre-training corpus combining high-quality data from multiple sources, specifically optimized for training Chinese language models.
48
+
49
+ <p align="center">
50
+ <img width="900px" alt="OpenCSG" src="./Data Sources.png">
51
+ </p>
52
+
53
+ These diverse data sources not only provide a rich content foundation for the **Chinese Fineweb Edu** dataset but also enhance its broad applicability and comprehensiveness by integrating data from different fields and sources. This data integration approach ensures that the model can maintain excellent performance and high-quality output when faced with diverse educational scenarios.
54
+
55
+ <p align="center">
56
+ <img width="600px" alt="OpenCSG" src="./data.png">
57
+ </p>
58
+
59
+
60
+
61
+ # Scoring Model
62
+
63
+ We utilized OpenCSG's enterprise-grade large language model, csg-wukong-enterprise, as the scoring model. By designing prompts, we enabled the model to score each pre-training sample on a scale of 0 to 5, divided into six levels:
64
+
65
+ 0 points: If the webpage provides no educational value whatsoever and consists entirely of irrelevant information (e.g., advertisements or promotional materials).
66
+
67
+ 1 point: If the webpage offers some basic information related to educational topics, even if it includes some unrelated or non-academic content (e.g., advertisements or promotional materials).
68
+
69
+ 2 points: If the webpage contains certain elements related to education but does not align well with educational standards. It might mix educational content with non-educational material, provide a shallow overview of potentially useful topics, or present information in an incoherent writing style.
70
+
71
+ 3 points: If the webpage is suitable for educational use and introduces key concepts related to school curricula. The content is coherent but may not be comprehensive or might include some irrelevant information. It could resemble the introductory section of a textbook or a basic tutorial, suitable for learning but with notable limitations, such as covering concepts that might be too complex for middle school students.
72
+
73
+ 4 points: If the webpage is highly relevant and beneficial for educational purposes at or below the high school level, exhibiting a clear and consistent writing style. It might resemble a chapter in a textbook or tutorial, providing substantial educational content, including exercises and solutions, with minimal irrelevant information. The concepts are not overly complex for middle school students. The content is coherent, with clear emphasis, and valuable for structured learning.
74
+
75
+ 5 points: If the excerpt demonstrates excellent educational value, being entirely suitable for elementary or middle school instruction. It follows a detailed reasoning process, with a writing style that is easy to understand, providing deep and comprehensive insights into the subject without including any non-educational or overly complex content.
76
+
77
+ We recorded 100,000 data samples along with their scores, creating the dataset `fineweb_edu_classifier_chinese_data`. Using the scores from this dataset as labels, we trained a Chinese BERT model, `fineweb_edu_classifier_chinese`, which can assign a score of 0-5 to each input text. We plan to further optimize this scoring model, and in the future, the OpenCSG algorithm team will open-source the `fineweb_edu_classifier_chinese_data` and the `fineweb_edu_classifier_chinese scoring model` to further promote community development and collaboration. This dataset contains meticulously annotated and scored educational text data, providing high-quality training data for researchers and developers.
78
+
79
+ **We warmly invite developers and researchers interested in this field to follow and engage with the community, working together to advance the technology. Stay tuned for the open-source release of the dataset!**
80
+
81
+ ## License Agreement
82
+
83
+ Usage of the Chinese Fineweb Edu dataset requires adherence to the OpenCSG Community License. The Chinese Fineweb Edu dataset supports commercial use. If you plan to use the OpenCSG model or its derivatives for commercial purposes, you must comply with the terms and conditions outlined in the OpenCSG Community License as well as the Apache 2.0 License. For commercial use, please send an email to [email protected] and obtain permission.
84
+
85
+ <a id="chinese"></a>
86
+
87
+ <p>
88
+
89
+ </p>
90
+
91
+ # csg-wukong-ablation-chinese-random
92
+
93
+ <p align="center">
94
+ <img width="600px" alt="OpenCSG" src="./Chinese Fineweb Edu Dataset logo.webp">
95
+ </p>
96
+
97
+
98
+ <p align="center"><a href="https://opencsg.com/models">[OpenCSG 社区]</a> <a href="https://github.com/OpenCSGs/Awesome-SLMs">[github]</a> <a href="https://cdn-uploads.huggingface.co/production/uploads/64c71b27d43e4dee51a8b31a/HU6vz21qKTEmUBCWqCFh9.jpeg">[微信]</a> <a href="https://twitter.com/OpenCsg">[推特]</a> </p>
99
+
100
+
101
+
102
+ </div>
103
+ **Chinese Fineweb Edu** 数据集是一个精心构建的高质量中文预训练语料数据集,专为教育领域的自然语言处理任务设计。该数据集通过严格的筛选和去重流程,利用少量数据训练打分模型进行评估,从海量的原始数据中提取出高价值的教育相关内容,确保数据的质量和多样性。最终,数据集包含约90M条高质量的中文文本数据,总大小约为300GB。
104
+
105
+
106
+
107
+ ## 筛选方法
108
+
109
+ 在数据筛选过程中,Chinese Fineweb Edu 数据集采用了与 Fineweb-Edu 类似的筛选策略,重点关注数据的教育价值和内容质量。具体筛选步骤如下:
110
+
111
+ 1. **教育价值评估**:首先使用Opencsg的csg-wukong-enterprise企业版大模型对样本的教育价值进行评估,模型会根据样本内容的相关性和质量给出0-5的评分。在初步筛选阶段,我们选取了约100k条评分较高的数据。
112
+
113
+ 2. **打分模型训练**:利用这100k条样本数据训练了一个BERT模型,用于对更大规模的预训练数据集进行文本打分。这一步确保了模型能够有效地识别出具有高教育价值的内容。
114
+
115
+ 3. **数据筛选**:接下来,使用训练好的BERT模型对原始数据进行全面打分,仅���留得分大于4的数据。这一筛选过程极大地提高了数据集的质量和相关性,确保了其在教育领域的应用价值。
116
+
117
+ 4. **MinHash去重**:为避免重复内容对模型训练的负面影响,数据集采用MinHash算法对所有数据进行了去重处理。这种方法确保了数据的独特性,同时保留了多样化的教育内容。
118
+
119
+ <p align="center">
120
+ <img width="900px" alt="OpenCSG" src="./Selection Method.png">
121
+ </p>
122
+
123
+ ## 原始数据来源
124
+
125
+ Chinese Fineweb Edu 数据集的原始数据来源广泛,涵盖了多个国内主流的中文预训练数据集。这些数据集虽然在规模和覆盖领域上各有不同,但通过精细筛选和处理,最终为Chinese Fineweb Edu 数据集提供了坚实的基础。主要数据来源包括:
126
+
127
+ - [CCI2-Data](https://huggingface.co/datasets/BAAI/CCI2-Data):经过严格的清洗、去重和质量过滤处理,一个高质量且可靠的中文安全数据集。
128
+ - [SkyPile-150B](https://huggingface.co/datasets/Skywork/SkyPile-150B):一个来自中国互联网上的1500亿token大规模数据集,经过复杂的过滤和去重处理
129
+ - [IndustryCorpus](https://huggingface.co/datasets/BAAI/IndustryCorpus):一个涵盖多个行业的中文预训练数据集,包含1TB的中文数据,特别适合行业特定的模型训练
130
+ - [Tele-AI](https://huggingface.co/datasets/Tele-AI/TeleChat-PTD):一个从电信星辰大模型TeleChat预训练语料中提取出的高质量大规模中文数据集,包含约2.7亿条经过严格过滤和去重处理的纯中文文本。
131
+ - [MAP-CC](https://huggingface.co/datasets/m-a-p/MAP-CC):一个规模庞大的中文预训练语料库,结合了多种来源的高质量数据,特别针对中文语言模型的训练进行了优化
132
+
133
+ <p align="center">
134
+ <img width="900px" alt="OpenCSG" src="./Data Sources.png">
135
+ </p>
136
+
137
+ 这些多样化的数据来源不仅为**Chinese Fineweb Edu**数据集提供了丰富的内容基础,还通过不同领域和来源的数据融合,提升了数据集的广泛适用性和全面性。这种数据整合方式确保了模型在面对多样化的教育场景时,能够保持卓越的表现和高质量的输出。
138
+
139
+ <p align="center">
140
+ <img width="600px" alt="OpenCSG" src="./data.png">
141
+ </p>
142
+
143
+ ## 打分模型
144
+
145
+ 我们使用OpenCSG的csg-wukong-enterprise企业版大模型作为打分模型,通过设计prompt,让其对每一条预训练样本进行打分,分数分为0-5分共6个等级:
146
+
147
+ 0分:如果网页没有提供任何教育价值,完全由无关信息(如广告、宣传材料)组成。
148
+
149
+ 1分:如果网页提供了一些与教育主题相关的基本信息,即使包含一些无关或非学术内容(如广告和宣传材料)。
150
+
151
+ 2分:如果网页涉及某些与教育相关的元素,但与教育标准不太吻合。它可能将教育内容与非教育材料混杂,对潜在有用的主题进行浅显概述,或以不连贯的写作风格呈现信息。
152
+
153
+ 3分:如果网页适合教育使用,并介绍了与学校课程相关的关键概念。内容连贯但可能不全面,或包含一些无关信息。它可能类似于教科书的介绍部分或基础教程,适合学习但有明显局限,如涉及对中学生来说过于复杂的概念。
154
+
155
+ 4分:如果网页对不高于中学水平的教育目的高度相关和有益,表现出清晰一致的写作风格。它可能类似于教科书的一个章节或教程,提供大量教育内容,包括练习和解答,极少包含无关信息,且概念对中学生来说不会过于深奥。内容连贯、重点突出,对结构化学习有价值。
156
+
157
+ 5分:如果摘录在教育价值上表现出色,完全适合小学或中学教学。它遵循详细的推理过程,写作风格易于理解,对主题提供深刻而全面的见解,不包含任何非教育性或复杂内容。
158
+
159
+ 我们记录了100k条数据及其得分,形成`fineweb_edu_classifier_chinese_data`。将数据集中的得分作为文本打分的标签,我们训练了一个中文Bert模型 `fineweb_edu_classifier_chinese`,此模型能够为每条输入文本给出0-5分的得分。我们会进一步优化这个打分模型,未来,OpenCSG算法团队将开源`fineweb_edu_classifier_chinese_data`数据集以及`fineweb_edu_classifier_chinese`打分模型,以进一步推动社区的发展和交流。该数据集包含了经过精细标注打分的教育领域文本数据,能够为研究人员和开发者提供高质量的训练数据。
160
+
161
+ **我们诚邀对这一领域感兴趣的开发者和研究者关注和联系社区,共同推动技术的进步。敬请期待数据集的开源发布!**
162
+
163
+ ## 许可协议
164
+ 使用 Chinese Fineweb Edu 数据集需要遵循 OpenCSG 社区许可证。Chinese Fineweb Edu 数据集支持商业用途。如果您计划将 OpenCSG 模型或其衍生产品用于商业目的,您必须遵守 OpenCSG 社区许可证以及 Apache 2.0 许可证中的条款和条件。如用于商业用途,需发送邮件至 [email protected],并获得许可。