metadata

license: bigscience-bloom-rail-1.0
language:
  - en
  - zht
pipeline_tag: text-generation

BLOOM-zh

Open-access Multilingual Language Model based on BLOOM

Model Card

Version 1.0 / 20.Feb.2023

This model is a joint collaboration between CKIP lab at Acedemia Sinica (website), MediaTek Research (website), and National Academy for Educational Research (website).

Model Details
Uses
Training Data
Risks and Limitations
Evaluation
Recommendations
Glossary and Calculations
More Information
Model Card Authors

Model Details

BLOOM-zh is a modification from BLOOMZ. BLOOM-zh is trained extendedly on larger amounts of Traditional Chinese text data while it still maintains its pretrained English ability.

Basics

This section provides information for anyone who wants to know about the model.

Click to expand

Developed by: MediaTek Research

Model Type: Transformer-based Language Model

Version: 1.0.0

Languages: Multiple; see training data

License: MEDIATEK RESEARCH License (link) and RAIL License v1.0 (link)

Release Date Estimate: Wednesday, 22.February.2023

Send Questions to: [email protected]

Cite as: MediaTek Research, MediaTek Research Open-access Multilingual Language Model based on BLOOM. International, February 2023.

Organizations of contributors:

MediaTek Research
Academia Sinica
National Academy for Educational Research

Technical Specifications

This section provides information for people who work on model development.

Click to expand

Model Architecture: Modified from Megatron-LM GPT2 (see paper, BLOOM Megatron code):

Decoder-only architecture
Layer normalization applied to word embeddings layer (StableEmbedding; see code, paper)
ALiBI positional encodings (see paper), with GeLU activation functions
1,065,314,304 parameters:
- 385,351,680 embedding parameters
- 24 layers, 16 attention heads
- Hidden layers are 1536-dimensional
- Sequence length of 2048 tokens used (see BLOOM tokenizer, tokenizer description)

Objective Function: Cross Entropy with mean reduction (see API documentation).

Compute infrastructure:

Hardware: 2 A6000 48GB GPUs (1 node):
Software:
- Bigscience Megatron-DeepSpeed (Github link)
- Megatron-DeepSpeed (Github link)
- DeepSpeed (Github link)
- PyTorch (pytorch-1.12 w/ CUDA-11.3; see Github link)
- apex (Github link)

Training

Details are provided in the paper.

Dates: Feb. 2023

Tokenization

The BLOOM tokenizer (link) is a learned subword tokenizer trained using:

A byte-level Byte Pair Encoding (BPE) algorithm
A simple pre-tokenization rule, no normalization
A vocabulary size of 250,680

It was trained on a subset of a preliminary version of the corpus using alpha-weighting per language.

Environmental Impact

Click to expand

Please refer to Model card.

Uses

This section addresses questions around how the model is intended to be used, discusses the foreseeable users of the model (including those affected by the model), and describes uses that are considered out of scope or misuse of the model. It provides information for anyone considering using the model or who is affected by the model.

Click to expand

Please refer to Model card.

Training Data

This section provides a high-level overview of the training data. It is relevant for anyone who wants to know the basics of what the model is learning.

Click to expand

We trained the 1B1 parameter model on a total of 6 Billion tokens mainly crawled from the internet and provided from National Academy for Educational Research. 75% of the training data is Traditional Chinese, 25% is English. Details are provided in the paper.

Risks and Limitations

This section identifies foreseeable harms and misunderstandings.

Click to expand

Please refer to Model card.

Factors

This section lists some different aspects of BLOOM models. Its focus is on those aspects that are likely to give rise to high variance in model behavior.

The model is trained on Traditional Chinese and English. However, the pretrained weights capture more than 40 different languages.
The model is trained on web crawled data, news articles, novels, knowledge sources (encyclopedia, education sector) and instructions

Recommendations

This section provides information on warnings and potential mitigations.

Click to expand

Please refer to Model card.

Model Card Authors

Ordered roughly chronologically and by amount of time spent.

Philipp Ennen, Po-Chun Hsu, Chan-Jan Hsu, Chang-Le Liu, Yin-Hsiang Liao, Chin-Tung Lin, Jezabel Rodriguez Garcia, Federica Freddi, Da-Shan Shiu, Wei-Yun Ma

ckip-joint
/

bloom-1b1-zh

BLOOM-zh

Open-access Multilingual Language Model based on BLOOM

Model Card

Table of Contents

Model Details

Basics

Technical Specifications

Training

Tokenization

Environmental Impact

Uses

Training Data

Risks and Limitations

Factors

Recommendations

Model Card Authors