|
--- |
|
language: |
|
- en |
|
- ko |
|
library_name: transformers |
|
tags: |
|
- pytorch |
|
- Yi-Ko |
|
- 01-ai |
|
- Yi |
|
extra_gated_heading: Access beomi/Yi-Ko-6B on Hugging Face |
|
extra_gated_button_content: Submit |
|
extra_gated_fields: |
|
I agree to share my name, email address and username: checkbox |
|
? I confirm that I understand this project is for research purposes only, and confirm |
|
that I agree to follow the LICENSE of this model |
|
: checkbox |
|
pipeline_tag: text-generation |
|
inference: false |
|
model-index: |
|
- name: Yi-Ko-6B |
|
results: |
|
- task: |
|
type: text-generation |
|
name: Text Generation |
|
dataset: |
|
name: AI2 Reasoning Challenge (25-Shot) |
|
type: ai2_arc |
|
config: ARC-Challenge |
|
split: test |
|
args: |
|
num_few_shot: 25 |
|
metrics: |
|
- type: acc_norm |
|
value: 48.89 |
|
name: normalized accuracy |
|
source: |
|
url: https://huggingface.co/spaces/HuggingFaceH4/open_llm_leaderboard?query=beomi/Yi-Ko-6B |
|
name: Open LLM Leaderboard |
|
- task: |
|
type: text-generation |
|
name: Text Generation |
|
dataset: |
|
name: HellaSwag (10-Shot) |
|
type: hellaswag |
|
split: validation |
|
args: |
|
num_few_shot: 10 |
|
metrics: |
|
- type: acc_norm |
|
value: 74.48 |
|
name: normalized accuracy |
|
source: |
|
url: https://huggingface.co/spaces/HuggingFaceH4/open_llm_leaderboard?query=beomi/Yi-Ko-6B |
|
name: Open LLM Leaderboard |
|
- task: |
|
type: text-generation |
|
name: Text Generation |
|
dataset: |
|
name: MMLU (5-Shot) |
|
type: cais/mmlu |
|
config: all |
|
split: test |
|
args: |
|
num_few_shot: 5 |
|
metrics: |
|
- type: acc |
|
value: 55.72 |
|
name: accuracy |
|
source: |
|
url: https://huggingface.co/spaces/HuggingFaceH4/open_llm_leaderboard?query=beomi/Yi-Ko-6B |
|
name: Open LLM Leaderboard |
|
- task: |
|
type: text-generation |
|
name: Text Generation |
|
dataset: |
|
name: TruthfulQA (0-shot) |
|
type: truthful_qa |
|
config: multiple_choice |
|
split: validation |
|
args: |
|
num_few_shot: 0 |
|
metrics: |
|
- type: mc2 |
|
value: 37.09 |
|
source: |
|
url: https://huggingface.co/spaces/HuggingFaceH4/open_llm_leaderboard?query=beomi/Yi-Ko-6B |
|
name: Open LLM Leaderboard |
|
- task: |
|
type: text-generation |
|
name: Text Generation |
|
dataset: |
|
name: Winogrande (5-shot) |
|
type: winogrande |
|
config: winogrande_xl |
|
split: validation |
|
args: |
|
num_few_shot: 5 |
|
metrics: |
|
- type: acc |
|
value: 72.93 |
|
name: accuracy |
|
source: |
|
url: https://huggingface.co/spaces/HuggingFaceH4/open_llm_leaderboard?query=beomi/Yi-Ko-6B |
|
name: Open LLM Leaderboard |
|
- task: |
|
type: text-generation |
|
name: Text Generation |
|
dataset: |
|
name: GSM8k (5-shot) |
|
type: gsm8k |
|
config: main |
|
split: test |
|
args: |
|
num_few_shot: 5 |
|
metrics: |
|
- type: acc |
|
value: 12.51 |
|
name: accuracy |
|
source: |
|
url: https://huggingface.co/spaces/HuggingFaceH4/open_llm_leaderboard?query=beomi/Yi-Ko-6B |
|
name: Open LLM Leaderboard |
|
license: apache-2.0 |
|
--- |
|
|
|
> Update @ 2024.01.29 New Model [beomi/Yi-Ko-DUS-9B](https://huggingface.co/beomi/Yi-Ko-DUS-9B) Released! 🎉 |
|
|
|
> Update @ 2023.12.03 Yi-Ko(KoEN)-6B Achieved #1🥇 Pretrained Models at [Open Korean LLM Leaderboard](https://huggingface.co/spaces/upstage/open-ko-llm-leaderboard)! 🎉 |
|
|
|
> Update @ 2023.12.01 Alpha Release of Yi-Ko(KoEN)-6B model 🎉 |
|
|
|
# **beomi/Yi-Ko-6B** |
|
|
|
Yi-Ko series models serve as advanced iterations of 01-ai/Yi models, |
|
benefiting from an expanded vocabulary and the inclusion of Korean/English corpus in its further pretraining. |
|
Just like its predecessor, Yi-Ko series models operate within the broad range of generative text models that stretch from 6 billion to 34 billion parameters. |
|
This repository focuses on the **6B** pretrained version, |
|
which is tailored to fit the Hugging Face Transformers format. |
|
For access to the other models, feel free to consult the index provided below. |
|
|
|
## Model Details |
|
|
|
**Model Developers** Junbum Lee (Beomi) |
|
|
|
**Variations** Yi-Ko series will come in a range of parameter sizes — 6B and 34B variations. |
|
|
|
**Input** Models input text only. |
|
|
|
**Output** Models generate text only. |
|
|
|
**Model Architecture** |
|
|
|
Yi-Ko series models are an auto-regressive language model that uses an optimized transformer architecture based on Llama-2*. |
|
|
|
<small>*Yi model architecture is based on Llama2, so it can be loaded via `LlamaForCausalLM` class on HF.</small> |
|
|
|
|Model Name|Training Data|Params|Context Length|GQA|Trained Tokens|LR|Batch Size(per step)| |
|
|---|---|---|---|---|---|---|---| |
|
|Yi-Ko-6B|*A mix of Korean + English online data*|6B|4k|O|>60B|5e<sup>-5</sup>|2048| |
|
|
|
**Vocab Expansion** |
|
|
|
| Model Name | Vocabulary Size | Description | |
|
| --- | --- | --- | |
|
| Original Yi-Series | 64000 | Sentencepiece BPE | |
|
| **Expanded Yi-Ko Series** | 78464 | Sentencepiece BPE. Added Korean vocab and merges | |
|
|
|
**Tokenizing "안녕하세요, 오늘은 날씨가 좋네요.ㅎㅎ"** |
|
|
|
| Model | # of tokens | Tokens | |
|
| --- | --- | --- | |
|
| Original Yi-Series | 47 | `['<0xEC>', '<0x95>', '<0x88>', '<0xEB>', '<0x85>', '<0x95>', '하', '<0xEC>', '<0x84>', '<0xB8>', '<0xEC>', '<0x9A>', '<0x94>', ',', '▁', '<0xEC>', '<0x98>', '<0xA4>', '<0xEB>', '<0x8A>', '<0x98>', '은', '▁', '<0xEB>', '<0x82>', '<0xA0>', '<0xEC>', '<0x94>', '<0xA8>', '가', '▁', '<0xEC>', '<0xA2>', '<0x8B>', '<0xEB>', '<0x84>', '<0xA4>', '<0xEC>', '<0x9A>', '<0x94>', '.', '<0xE3>', '<0x85>', '<0x8E>', '<0xE3>', '<0x85>', '<0x8E>']` | |
|
| **Expanded Yi-Ko Series** | 10 | `['▁안녕', '하세요', ',', '▁오늘은', '▁날', '씨가', '▁좋네요', '.', 'ㅎ', 'ㅎ']` | |
|
|<small>*Equal Korean vocab with Llama-2-Ko Series</small>|| |
|
|
|
**Tokenizing "Llama 2: Open Foundation and Fine-Tuned Chat Models"** |
|
|
|
| Model | # of tokens | Tokens | |
|
| --- | --- | --- | |
|
| Original Yi-Series | 21 | `['The', '▁Y', 'i', '▁series', '▁models', '▁are', '▁large', '▁language', '▁models', '▁trained', '▁from', '▁scratch', '▁by', '▁developers', '▁at', '▁', '0', '1', '.', 'AI', '.']` | |
|
| **Expanded Yi-Ko Series** | 21 | `['▁The', '▁Y', 'i', '▁series', '▁models', '▁are', '▁large', '▁language', '▁models', '▁trained', '▁from', '▁scratch', '▁by', '▁developers', '▁at', '▁', '0', '1', '.', 'AI', '.']` | |
|
|<small>*Equal Korean vocab with Llama-2-Ko Series</small>| | <small>*Since **Expanded Yi-Ko Series** prepends `_` at the beginning of the text(to ensure same tokenization for Korean sentences), it shows negilible difference for the first token on English tokenization. </small>| |
|
|
|
# **Model Benchmark** |
|
|
|
## LM Eval Harness - Korean (polyglot branch) |
|
|
|
| beomi/Yi-Ko-6B | 0 | 5 | 10 | 50 | |
|
|:---------------------------------|---------:|---------:|---------:|---------:| |
|
| kobest_boolq (macro_f1) | 0.705806 | 0.79905 | 0.814299 | 0.81704 | |
|
| kobest_copa (macro_f1) | 0.775604 | 0.808899 | 0.816866 | 0.842943 | |
|
| kobest_hellaswag (macro_f1) | 0.500876 | 0.498673 | 0.493507 | 0.492183 | |
|
| kobest_sentineg (macro_f1) | 0.404371 | 0.967254 | 0.982368 | 0.974811 | |
|
| kohatespeech (macro_f1) | 0.353428 | 0.351804 | 0.402423 | 0.503764 | |
|
| kohatespeech_apeach (macro_f1) | 0.337667 | 0.498679 | 0.471962 | 0.608401 | |
|
| kohatespeech_gen_bias (macro_f1) | 0.124535 | 0.484745 | 0.474475 | 0.461714 | |
|
| korunsmile (f1) | 0.382804 | 0.349344 | 0.391383 | 0.432875 | |
|
| nsmc (acc) | 0.55064 | 0.8801 | 0.89866 | 0.9071 | |
|
| pawsx_ko (acc) | 0.5145 | 0.54 | 0.538 | 0.5165 | |
|
|
|
## LICENSE |
|
|
|
Apache 2.0 (for research) |
|
|
|
> For commercial purpose, |
|
> mailto: [email protected] to acquire Yi-Ko sereis commercial license. |
|
|
|
## Citation |
|
|
|
Please use this bibtex below: |
|
|
|
``` |
|
@misc {lee_junbum_2024, |
|
author = { {Lee Junbum} }, |
|
title = { Yi-Ko-6B (Revision 205083a) }, |
|
year = 2024, |
|
url = { https://huggingface.co/beomi/Yi-Ko-6B }, |
|
doi = { 10.57967/hf/1708 }, |
|
publisher = { Hugging Face } |
|
} |
|
``` |
|
|
|
## Acknowledgement |
|
|
|
The training is supported by [TPU Research Cloud](https://sites.research.google/trc/) program. |
|
# [Open LLM Leaderboard Evaluation Results](https://huggingface.co/spaces/HuggingFaceH4/open_llm_leaderboard) |
|
Detailed results can be found [here](https://huggingface.co/datasets/open-llm-leaderboard/details_beomi__Yi-Ko-6B) |
|
|
|
| Metric |Value| |
|
|---------------------------------|----:| |
|
|Avg. |50.27| |
|
|AI2 Reasoning Challenge (25-Shot)|48.89| |
|
|HellaSwag (10-Shot) |74.48| |
|
|MMLU (5-Shot) |55.72| |
|
|TruthfulQA (0-shot) |37.09| |
|
|Winogrande (5-shot) |72.93| |
|
|GSM8k (5-shot) |12.51| |
|
|
|
|