maywell commited on
Commit
4337fef
1 Parent(s): 05b40e0

Create README.md

Browse files
Files changed (1) hide show
  1. README.md +122 -0
README.md ADDED
@@ -0,0 +1,122 @@
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1
+ ---
2
+ license: other
3
+ license_name: yi-license
4
+ license_link: LICENSE
5
+ language:
6
+ - en
7
+ - ko
8
+ pipeline_tag: text-generation
9
+ inference: false
10
+ tags:
11
+ - pytorch
12
+ - Yi-Ko
13
+ - 01-ai
14
+ - Yi
15
+ library_name: transformers
16
+ ---
17
+ # Yi Ko 34B Instruct
18
+
19
+ ## Training Process
20
+
21
+ 1. Further trained with Korean corpus.
22
+ 2. SFT
23
+ 3. DPO [(Dataset URL)](https://huggingface.co/datasets/argilla/distilabel-capybara-dpo-7k-binarized)
24
+
25
+ ## Model Info
26
+
27
+ | Context Length | Parameter | Prompt Template | MMLU(5-shot) |
28
+ | --- | --- | --- | --- |
29
+ | 4k(4096) | 34B | ChatML | Partly | 49.03 |
30
+
31
+ # Original Model Card by [beomi](https://huggingface.co/beomi)
32
+
33
+ Yi-Ko series models serve as advanced iterations of 01-ai/Yi models,
34
+ benefiting from an expanded vocabulary and the inclusion of Korean/English corpus in its further pretraining.
35
+ Just like its predecessor, Yi-Ko series models operate within the broad range of generative text models that stretch from 6 billion to 34 billion parameters.
36
+ This repository focuses on the **34B** pretrained version,
37
+ which is tailored to fit the Hugging Face Transformers format.
38
+ For access to the other models, feel free to consult the index provided below.
39
+
40
+ ## Model Details
41
+
42
+ **Model Developers** Junbum Lee (Beomi)
43
+
44
+ **Variations** Yi-Ko-34B will come in a range of parameter sizes — 6B and 34B — with Ko(Korean+English).
45
+
46
+ **Input** Models input text only.
47
+
48
+ **Output** Models generate text only.
49
+
50
+ **Model Architecture**
51
+
52
+ Yi-Ko series models are an auto-regressive language model that uses an optimized transformer architecture based on Llama-2*.
53
+
54
+ <small>*Yi model architecture is based on Llama2, so it can be loaded via `LlamaForCausalLM` class on HF.</small>
55
+
56
+ |Model Name|Training Data|Params|Context Length|GQA|Trained Tokens|LR|Train tokens (per batch)|
57
+ |---|---|---|---|---|---|---|---|
58
+ |Yi-Ko-34B|*A mix of Korean + English online data*|34B|4k|O|40B+|5e<sup>-5</sup>|4M|
59
+
60
+ **Vocab Expansion**
61
+
62
+ | Model Name | Vocabulary Size | Description |
63
+ | --- | --- | --- |
64
+ | Original Yi-Series | 64000 | Sentencepiece BPE |
65
+ | **Expanded Yi-Ko Series** | 78464 | Sentencepiece BPE. Added Korean vocab and merges |
66
+
67
+ **Tokenizing "안녕하세요, 오늘은 날씨가 좋네요.ㅎㅎ"**
68
+
69
+ | Model | # of tokens | Tokens |
70
+ | --- | --- | --- |
71
+ | Original Yi-Series | 47 | `['<0xEC>', '<0x95>', '<0x88>', '<0xEB>', '<0x85>', '<0x95>', '하', '<0xEC>', '<0x84>', '<0xB8>', '<0xEC>', '<0x9A>', '<0x94>', ',', '▁', '<0xEC>', '<0x98>', '<0xA4>', '<0xEB>', '<0x8A>', '<0x98>', '은', '▁', '<0xEB>', '<0x82>', '<0xA0>', '<0xEC>', '<0x94>', '<0xA8>', '가', '▁', '<0xEC>', '<0xA2>', '<0x8B>', '<0xEB>', '<0x84>', '<0xA4>', '<0xEC>', '<0x9A>', '<0x94>', '.', '<0xE3>', '<0x85>', '<0x8E>', '<0xE3>', '<0x85>', '<0x8E>']` |
72
+ | **Expanded Yi-Ko Series** | 10 | `['▁안녕', '하세요', ',', '▁오늘은', '▁날', '씨가', '▁좋네요', '.', 'ㅎ', 'ㅎ']` |
73
+ |<small>*Equal Korean vocab with Llama-2-Ko Series</small>||
74
+
75
+ **Tokenizing "Llama 2: Open Foundation and Fine-Tuned Chat Models"**
76
+
77
+ | Model | # of tokens | Tokens |
78
+ | --- | --- | --- |
79
+ | Original Yi-Series | 21 | `['The', '▁Y', 'i', '▁series', '▁models', '▁are', '▁large', '▁language', '▁models', '▁trained', '▁from', '▁scratch', '▁by', '▁developers', '▁at', '▁', '0', '1', '.', 'AI', '.']` |
80
+ | **Expanded Yi-Ko Series** | 21 | `['▁The', '▁Y', 'i', '▁series', '▁models', '▁are', '▁large', '▁language', '▁models', '▁trained', '▁from', '▁scratch', '▁by', '▁developers', '▁at', '▁', '0', '1', '.', 'AI', '.']` |
81
+ |<small>*Equal Korean vocab with Llama-2-Ko Series</small>| | <small>*Since **Expanded Yi-Ko Series** prepends `_` at the beginning of the text(to ensure same tokenization for Korean sentences), it shows negilible difference for the first token on English tokenization. </small>|
82
+
83
+ # **Model Benchmark**
84
+
85
+ ## LM Eval Harness - Korean Benchmarks
86
+
87
+ | Tasks |Version|Filter|n-shot| Metric |Value | |Stderr|
88
+ |----------------|------:|------|-----:|--------|-----:|---|------|
89
+ |**kmmlu_direct**|N/A |none | 5|exact_match|**0.5027**|± |0.1019|
90
+ |kobest_boolq | 1|none | 5|acc |0.9202|± |0.0072|
91
+ | | |none | 5|f1 |0.9202|± |N/A |
92
+ |kobest_copa | 1|none | 5|acc |0.8480|± |0.0114|
93
+ | | |none | 5|f1 |0.8479|± |N/A |
94
+ |kobest_hellaswag| 1|none | 5|acc |0.5320|± |0.0223|
95
+ | | |none | 5|f1 |0.5281|± |N/A |
96
+ | | |none | 5|acc_norm|0.6340|± |0.0216|
97
+ |kobest_sentineg | 1|none | 5|acc |0.9874|± |0.0056|
98
+ | | |none | 5|f1 |0.9874|± |N/A |
99
+ |haerae |N/A |none | 5|acc |0.7965|± |0.0116|
100
+ | | |none | 5|acc_norm|0.7965|± |0.0116|
101
+ | - haerae_general_knowledge | 1|none | 5|acc |0.5114|± |0.0378|
102
+ | | |none | 5|acc_norm|0.5114|± |0.0378|
103
+ | - haerae_history | 1|none | 5|acc |0.8511|± |0.0260|
104
+ | | |none | 5|acc_norm|0.8511|± |0.0260|
105
+ | - haerae_loan_word | 1|none | 5|acc |0.8402|± |0.0283|
106
+ | | |none | 5|acc_norm|0.8402|± |0.0283|
107
+ | - haerae_rare_word | 1|none | 5|acc |0.8642|± |0.0170|
108
+ | | |none | 5|acc_norm|0.8642|± |0.0170|
109
+ | - haerae_standard_nomenclature| 1|none | 5|acc |0.8301|± |0.0305|
110
+ | | |none | 5|acc_norm|0.8301|± |0.0305|
111
+
112
+ ## LICENSE
113
+
114
+ Follows Yi License
115
+
116
+ ## Citation
117
+
118
+
119
+
120
+ ## Acknowledgement
121
+
122
+ The training is supported by [TPU Research Cloud](https://sites.research.google/trc/) program.