File size: 6,859 Bytes
ff465fa 3a01a1b ff465fa 3a01a1b b1c9ecf 3a01a1b 1b0edc1 3a01a1b b1c9ecf 3a01a1b 1b0edc1 3a01a1b 1b0edc1 3a01a1b 1b0edc1 3a01a1b 491c121 3a01a1b b1c9ecf 3a01a1b b1c9ecf 3a01a1b 1b0edc1 f9c4fd4 3a01a1b 491c121 |
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 51 52 53 54 55 56 57 58 59 60 61 62 63 64 65 66 67 68 69 70 71 72 73 74 75 76 77 78 79 80 81 82 83 84 85 86 87 88 89 90 91 92 93 94 95 96 97 98 99 100 101 102 103 104 105 106 107 108 109 110 111 112 113 114 115 116 117 118 119 120 121 122 123 124 125 126 127 128 129 130 131 132 133 134 135 136 137 138 139 140 141 142 143 144 145 146 147 148 149 150 151 152 153 154 155 156 157 158 159 160 161 162 163 164 165 166 167 168 169 170 171 172 173 174 175 176 177 178 179 180 181 182 183 184 185 186 187 188 189 190 191 192 193 194 195 196 197 198 199 200 201 202 203 204 205 206 207 |
---
license: bigscience-bloom-rail-1.0
language:
- en
- zht
pipeline_tag: text-generation
---
<h1 style='text-align: center '>BLOOM-zh</h1>
<h2 style='text-align: center '><em>Open-access Multilingual Language Model based on BLOOM</em> </h2>
<h3 style='text-align: center '>Model Card</h3>
Version 1.0 / 20.Feb.2023
This model is a joint collaboration between CKIP lab at Acedemia Sinica ([website](https://ckip.iis.sinica.edu.tw/)), MediaTek Research ([website](https://www.mtkresearch.com/)), and National Academy for Educational Research ([website](https://www.naer.edu.tw/)).
## Table of Contents
1. [Model Details](#model-details)
2. [Uses](#uses)
3. [Training Data](#training-data)
4. [Risks and Limitations](#risks-and-limitations)
5. [Evaluation](#evaluation)
6. [Recommendations](#recommendations)
7. [Glossary and Calculations](#glossary-and-calculations)
8. [More Information](#more-information)
9. [Model Card Authors](#model-card-authors)
## Model Details
BLOOM-zh is a modification from [BLOOMZ](https://huggingface.co/bigscience/bloomz).
BLOOM-zh is trained extendedly on larger amounts of Traditional Chinese text data while it still maintains its pretrained English ability.
### Basics
*This section provides information for anyone who wants to know about the model.*
<details>
<summary>Click to expand</summary> <br/>
**Developed by:** MediaTek Research
**Model Type:** Transformer-based Language Model
**Version:** 1.0.0
**Languages:** Multiple; see [training data](#training-data)
**License:** MEDIATEK RESEARCH License ([link](https://huggingface.co/ckip-joint/bloom-1b1-zh/blob/main/LICENSE_MR.md)) and RAIL License v1.0 ([link](https://huggingface.co/spaces/bigscience/license))
**Release Date Estimate:** Wednesday, 22.February.2023
**Send Questions to:** [email protected]
**Cite as:** MediaTek Research, MediaTek Research Open-access Multilingual Language Model based on BLOOM. International, February 2023.
**Organizations of contributors:**
* MediaTek Research
* Academia Sinica
* National Academy for Educational Research
</details>
### Technical Specifications
*This section provides information for people who work on model development.*
<details>
<summary>Click to expand</summary><br/>
**Model Architecture:** Modified from Megatron-LM GPT2 (see [paper](https://arxiv.org/abs/1909.08053), [BLOOM Megatron code](https://github.com/bigscience-workshop/Megatron-DeepSpeed)):
* Decoder-only architecture
* Layer normalization applied to word embeddings layer (`StableEmbedding`; see [code](https://github.com/facebookresearch/bitsandbytes), [paper](https://arxiv.org/pdf/2110.02861.pdf))
* ALiBI positional encodings (see [paper](https://arxiv.org/pdf/2108.12409.pdf)), with GeLU activation functions
* 1,065,314,304 parameters:
* 385,351,680 embedding parameters
* 24 layers, 16 attention heads
* Hidden layers are 1536-dimensional
* Sequence length of 2048 tokens used (see [BLOOM tokenizer](https://huggingface.co/bigscience/tokenizer), [tokenizer description](#tokenization))
**Objective Function:** Cross Entropy with mean reduction (see [API documentation](https://pytorch.org/docs/stable/generated/torch.nn.CrossEntropyLoss.html#torch.nn.CrossEntropyLoss)).
**Compute infrastructure:**
* Hardware: 2 A6000 48GB GPUs (1 node):
* Software:
* Bigscience Megatron-DeepSpeed ([Github link](https://github.com/bigscience-workshop/Megatron-DeepSpeed))
* Megatron-DeepSpeed ([Github link](https://github.com/bigscience-workshop/Megatron-DeepSpeed))
* DeepSpeed ([Github link](https://github.com/microsoft/DeepSpeed))
* PyTorch (pytorch-1.12 w/ CUDA-11.3; see [Github link](https://github.com/pytorch/pytorch))
* apex ([Github link](https://github.com/NVIDIA/apex))
#### **Training**
Details are provided in the [paper](https://arxiv.org/).
- Dates: Feb. 2023
#### **Tokenization**
The BLOOM tokenizer ([link](https://huggingface.co/bigscience/tokenizer)) is a learned subword tokenizer trained using:
- A byte-level Byte Pair Encoding (BPE) algorithm
- A simple pre-tokenization rule, no normalization
- A vocabulary size of 250,680
It was trained on a subset of a preliminary version of the corpus using alpha-weighting per language.
</details>
### Environmental Impact
<details>
<summary>Click to expand</summary><br/>
Please refer to [Model card](https://huggingface.co/bigscience/bloom-1b1#model-details).
</details>
<p> </p>
## Uses
*This section addresses questions around how the model is intended to be used, discusses the foreseeable users of the model (including those affected by the model), and describes uses that are considered out of scope or misuse of the model.
It provides information for anyone considering using the model or who is affected by the model.*
<details>
<summary>Click to expand</summary><br/>
Please refer to [Model card](https://huggingface.co/bigscience/bloom-1b1#uses).
</details>
<p> </p>
## Training Data
*This section provides a high-level overview of the training data. It is relevant for anyone who wants to know the basics of what the model is learning.*
<details>
<summary>Click to expand</summary><br/>
We trained the 1B1 parameter model on a total of 6 Billion tokens mainly crawled from the internet and provided from National Academy for Educational Research. 75% of the training data is Traditional Chinese, 25% is English.
Details are provided in the [paper](https://arxiv.org/).
</details>
</details>
<p> </p>
## Risks and Limitations
*This section identifies foreseeable harms and misunderstandings.*
<details>
<summary>Click to expand</summary><br/>
Please refer to [Model card](https://huggingface.co/bigscience/bloom-1b1#risks-and-limitations).
</details>
<p> </p>
### Factors
*This section lists some different aspects of BLOOM models. Its focus is on those aspects that are likely to give rise to high variance in model behavior.*
- The model is trained on Traditional Chinese and English. However, the pretrained weights capture more than 40 different languages.
- The model is trained on web crawled data, news articles, novels, knowledge sources (encyclopedia, education sector) and instructions
</details>
<p> </p>
## Recommendations
*This section provides information on warnings and potential mitigations.*
<details>
<summary>Click to expand</summary><br/>
Please refer to [Model card](https://huggingface.co/bigscience/bloom-1b1#recommendations).
</details>
<p> </p>
## Model Card Authors
*Ordered roughly chronologically and by amount of time spent.*
Philipp Ennen, Po-Chun Hsu, Chan-Jan Hsu, Chang-Le Liu, Yin-Hsiang Liao, Chin-Tung Lin, Jezabel Rodriguez Garcia, Federica Freddi, Da-Shan Shiu, Wei-Yun Ma
<!-- # Bloom_eval -->
|