Update README.md
Browse files
README.md
CHANGED
@@ -13,7 +13,7 @@ license_link: >-
|
|
13 |
<div align="center"><img src="misc/skywork_logo.jpeg" width="550"/></div>
|
14 |
|
15 |
<p align="center">
|
16 |
-
👨💻 <a href="https://github.com/SkyworkAI/Skywork" target="_blank">Github</a> • 🤗 <a href="https://huggingface.co/Skywork" target="_blank">Hugging Face</a>• 🤖 <a href="https://modelscope.cn/organization/Skywork" target="_blank">ModelScope</a> • 💬 <a href="https://github.com/SkyworkAI/Skywork/blob/main/misc/wechat.png?raw=true" target="_blank">WeChat</a>• 📜<a href="
|
17 |
|
18 |
</p>
|
19 |
|
@@ -44,9 +44,9 @@ license_link: >-
|
|
44 |
**Skywork-13B-Base**: The model was trained on a high-quality cleaned dataset consisting of 3.2 trillion multilingual data (mainly Chinese and English) and code. It has demonstrated the best performance among models of similar scale in various evaluations and benchmark tests.
|
45 |
|
46 |
|
47 |
-
如果您希望了解更多的信息,如训练方案,评估方法,请参考我们的[技术报告](
|
48 |
|
49 |
-
If you are interested in more training and evaluation details, please refer to our [technical report](
|
50 |
|
51 |
## 训练数据(Training Data)
|
52 |
我们精心搭建了数据清洗流程对文本中的低质量数据、有害信息、敏感信息进行清洗过滤。我们的Skywork-13B-Base模型是在清洗后的3.2TB高质量中、英、代码数据上进行训练,其中英文占比52.2%,中文占比39.6%,代码占比8%,在兼顾中文和英文上的表现的同时,代码能力也能有保证。
|
@@ -325,11 +325,13 @@ The community usage of Skywork model requires [Skywork Community License](https:
|
|
325 |
|
326 |
If you find our work helpful, please feel free to cite our paper~
|
327 |
```
|
328 |
-
@
|
329 |
-
|
330 |
-
|
331 |
-
|
332 |
-
|
|
|
|
|
333 |
}
|
334 |
```
|
335 |
|
|
|
13 |
<div align="center"><img src="misc/skywork_logo.jpeg" width="550"/></div>
|
14 |
|
15 |
<p align="center">
|
16 |
+
👨💻 <a href="https://github.com/SkyworkAI/Skywork" target="_blank">Github</a> • 🤗 <a href="https://huggingface.co/Skywork" target="_blank">Hugging Face</a>• 🤖 <a href="https://modelscope.cn/organization/Skywork" target="_blank">ModelScope</a> • 💬 <a href="https://github.com/SkyworkAI/Skywork/blob/main/misc/wechat.png?raw=true" target="_blank">WeChat</a>• 📜<a href="http://arxiv.org/abs/2310.19341" target="_blank">Tech Report</a>
|
17 |
|
18 |
</p>
|
19 |
|
|
|
44 |
**Skywork-13B-Base**: The model was trained on a high-quality cleaned dataset consisting of 3.2 trillion multilingual data (mainly Chinese and English) and code. It has demonstrated the best performance among models of similar scale in various evaluations and benchmark tests.
|
45 |
|
46 |
|
47 |
+
如果您希望了解更多的信息,如训练方案,评估方法,请参考我们的[技术报告](http://arxiv.org/abs/2310.19341),[Skymath](https://arxiv.org/abs/2310.16713)论文,[SkyworkMM](https://github.com/will-singularity/Skywork-MM/blob/main/skywork_mm.pdf)论文。
|
48 |
|
49 |
+
If you are interested in more training and evaluation details, please refer to our [technical report](http://arxiv.org/abs/2310.19341), [Skymath]((https://arxiv.org/skywork-tech-report)) paper and [SkyworkMM](https://github.com/will-singularity/Skywork-MM/blob/main/skywork_mm.pdf) paper.
|
50 |
|
51 |
## 训练数据(Training Data)
|
52 |
我们精心搭建了数据清洗流程对文本中的低质量数据、有害信息、敏感信息进行清洗过滤。我们的Skywork-13B-Base模型是在清洗后的3.2TB高质量中、英、代码数据上进行训练,其中英文占比52.2%,中文占比39.6%,代码占比8%,在兼顾中文和英文上的表现的同时,代码能力也能有保证。
|
|
|
325 |
|
326 |
If you find our work helpful, please feel free to cite our paper~
|
327 |
```
|
328 |
+
@misc{wei2023skywork,
|
329 |
+
title={Skywork: A More Open Bilingual Foundation Model},
|
330 |
+
author={Tianwen Wei and Liang Zhao and Lichang Zhang and Bo Zhu and Lijie Wang and Haihua Yang and Biye Li and Cheng Cheng and Weiwei Lü and Rui Hu and Chenxia Li and Liu Yang and Xilin Luo and Xuejie Wu and Lunan Liu and Wenjun Cheng and Peng Cheng and Jianhao Zhang and Xiaoyu Zhang and Lei Lin and Xiaokun Wang and Yutuan Ma and Chuanhai Dong and Yanqi Sun and Yifu Chen and Yongyi Peng and Xiaojuan Liang and Shuicheng Yan and Han Fang and Yahui Zhou},
|
331 |
+
year={2023},
|
332 |
+
eprint={2310.19341},
|
333 |
+
archivePrefix={arXiv},
|
334 |
+
primaryClass={cs.CL}
|
335 |
}
|
336 |
```
|
337 |
|