Merge branch 'main' of https://huggingface.co/beomi/open-llama-2-ko
Browse files- LICENSE +21 -0
- README.md +126 -0
- corpus/AI_HUB +50 -0
- corpus/MODU_CORPUS +6 -0
LICENSE
ADDED
@@ -0,0 +1,21 @@
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
1 |
+
MIT License
|
2 |
+
|
3 |
+
Copyright (c) 2023 Junbum Lee(Beomi)
|
4 |
+
|
5 |
+
Permission is hereby granted, free of charge, to any person obtaining a copy
|
6 |
+
of this software and associated documentation files (the "Software"), to deal
|
7 |
+
in the Software without restriction, including without limitation the rights
|
8 |
+
to use, copy, modify, merge, publish, distribute, sublicense, and/or sell
|
9 |
+
copies of the Software, and to permit persons to whom the Software is
|
10 |
+
furnished to do so, subject to the following conditions:
|
11 |
+
|
12 |
+
The above copyright notice and this permission notice shall be included in all
|
13 |
+
copies or substantial portions of the Software.
|
14 |
+
|
15 |
+
THE SOFTWARE IS PROVIDED "AS IS", WITHOUT WARRANTY OF ANY KIND, EXPRESS OR
|
16 |
+
IMPLIED, INCLUDING BUT NOT LIMITED TO THE WARRANTIES OF MERCHANTABILITY,
|
17 |
+
FITNESS FOR A PARTICULAR PURPOSE AND NONINFRINGEMENT. IN NO EVENT SHALL THE
|
18 |
+
AUTHORS OR COPYRIGHT HOLDERS BE LIABLE FOR ANY CLAIM, DAMAGES OR OTHER
|
19 |
+
LIABILITY, WHETHER IN AN ACTION OF CONTRACT, TORT OR OTHERWISE, ARISING FROM,
|
20 |
+
OUT OF OR IN CONNECTION WITH THE SOFTWARE OR THE USE OR OTHER DEALINGS IN THE
|
21 |
+
SOFTWARE.
|
README.md
ADDED
@@ -0,0 +1,126 @@
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
1 |
+
---
|
2 |
+
language:
|
3 |
+
- ko
|
4 |
+
- en
|
5 |
+
pipeline_tag: text-generation
|
6 |
+
inference: false
|
7 |
+
tags:
|
8 |
+
- facebook
|
9 |
+
- meta
|
10 |
+
- pytorch
|
11 |
+
- llama
|
12 |
+
- llama-2
|
13 |
+
- kollama
|
14 |
+
- llama-2-ko
|
15 |
+
license: mit
|
16 |
+
library_name: transformers
|
17 |
+
---
|
18 |
+
|
19 |
+
**Update Log**
|
20 |
+
|
21 |
+
- 2023.12.14: First Release of Open-Llama-2-Ko
|
22 |
+
|
23 |
+
# **Open-Llama-2-Ko** ๐ฆ๐ฐ๐ท
|
24 |
+
|
25 |
+
Open-Llama-2-Ko serves as an advanced iteration of Llama 2, benefiting from an expanded vocabulary and the inclusion of a Korean corpus in its further pretraining.
|
26 |
+
Just like its predecessor, Llama-2-Ko operates within the broad range of generative text models that stretch from 7 billion to 70 billion parameters.
|
27 |
+
This repository focuses on the 7B pretrained version, which is tailored to fit the Hugging Face Transformers format.
|
28 |
+
|
29 |
+
The main difference between Llama-2-Ko Series and Open-Llama-2-Ko is the dataset, Open-Llama-2-Ko series only used publicly accessable Korean corpus,
|
30 |
+
including [AI Hub](https://www.aihub.or.kr), [Modu Corpus, ๋ชจ๋์ ๋ง๋ญ์น](https://corpus.korean.go.kr/) and [Korean Wikipedia](https://dumps.wikimedia.org/kowiki/).
|
31 |
+
|
32 |
+
Since the train is done using only publicly available corpus, this model is opened to everyone without any restrictions. (*This model follows MIT License)
|
33 |
+
|
34 |
+
## Model Details
|
35 |
+
|
36 |
+
**Model Developers** Junbum Lee (Beomi)
|
37 |
+
|
38 |
+
**Variations** Open-Llama-2-Ko will come in a range of parameter sizes โ 7B and 13B โ as well as pretrained variations.
|
39 |
+
|
40 |
+
**Input** Models input text only.
|
41 |
+
|
42 |
+
**Output** Models generate text only.
|
43 |
+
|
44 |
+
**Model Architecture**
|
45 |
+
|
46 |
+
Open-Llama-2-Ko is an auto-regressive language model that uses an optimized transformer architecture based on Llama-2.
|
47 |
+
|
48 |
+
||Training Data|Params|Content Length|GQA|Tokens|LR|
|
49 |
+
|---|---|---|---|---|---|---|
|
50 |
+
|Llama 2|*A new mix of Publicly Accessable Korean Corpus*|7B|2k|✗|>15B*|5e<sup>-5</sup>|
|
51 |
+
|
52 |
+
**Train Corpus**
|
53 |
+
|
54 |
+
Trained with selected corpus within AIHub/Modu Corpus. The detailed dataset list to train this model is available below:
|
55 |
+
|
56 |
+
- AI Hub: [corpus/AI_HUB](./corpus/AI_HUB)
|
57 |
+
- Used only `Training` part of the data.
|
58 |
+
- Explicitly dropped `Validation`/`Test` part of the data.
|
59 |
+
- Modu Corpus: [corpus/MODU_CORPUS](./corpus/MODU_CORPUS)
|
60 |
+
|
61 |
+
Final JSONL dataset to trian this model is: 61GB.
|
62 |
+
|
63 |
+
Total amount of tokens: (Approx.) 15B Tokens (*using expanded tokenizer. with original Llama tokenizer, >60B tokens.)
|
64 |
+
|
65 |
+
**Vocab Expansion**
|
66 |
+
|
67 |
+
| Model Name | Vocabulary Size | Description |
|
68 |
+
| --- | --- | --- |
|
69 |
+
| Original Llama-2 | 32000 | Sentencepiece BPE |
|
70 |
+
| **Expanded Llama-2-Ko** | 46336 | Sentencepiece BPE. Added Korean vocab and merges |
|
71 |
+
|
72 |
+
**Tokenizing "์๋
ํ์ธ์, ์ค๋์ ๋ ์จ๊ฐ ์ข๋ค์."**
|
73 |
+
|
74 |
+
| Model | Tokens |
|
75 |
+
| --- | --- |
|
76 |
+
| Llama-2 | `['โ', '์', '<0xEB>', '<0x85>', '<0x95>', 'ํ', '์ธ', '์', ',', 'โ', '์ค', '<0xEB>', '<0x8A>', '<0x98>', '์', 'โ', '<0xEB>', '<0x82>', '<0xA0>', '์จ', '๊ฐ', 'โ', '<0xEC>', '<0xA2>', '<0x8B>', '<0xEB>', '<0x84>', '<0xA4>', '์']` |
|
77 |
+
| Llama-2-Ko | `['โ์๋
', 'ํ์ธ์', ',', 'โ์ค๋์', 'โ๋ ', '์จ๊ฐ', 'โ์ข๋ค์']` |
|
78 |
+
|
79 |
+
**Tokenizing "Llama 2: Open Foundation and Fine-Tuned Chat Models"**
|
80 |
+
|
81 |
+
| Model | Tokens |
|
82 |
+
| --- | --- |
|
83 |
+
| Llama-2 | `['โL', 'l', 'ama', 'โ', '2', ':', 'โOpen', 'โFoundation', 'โand', 'โFine', '-', 'T', 'un', 'ed', 'โCh', 'at', 'โMod', 'els']` |
|
84 |
+
| Llama-2-Ko | `['โL', 'l', 'ama', 'โ', '2', ':', 'โOpen', 'โFoundation', 'โand', 'โFine', '-', 'T', 'un', 'ed', 'โCh', 'at', 'โMod', 'els']` |
|
85 |
+
|
86 |
+
# **Model Benchmark**
|
87 |
+
|
88 |
+
## LM Eval Harness - Korean (polyglot branch)
|
89 |
+
|
90 |
+
- Used EleutherAI's lm-evaluation-harness https://github.com/EleutherAI/lm-evaluation-harness/tree/polyglot
|
91 |
+
|
92 |
+
TBD
|
93 |
+
|
94 |
+
## Note for oobabooga/text-generation-webui
|
95 |
+
|
96 |
+
Remove `ValueError` at `load_tokenizer` function(line 109 or near), in `modules/models.py`.
|
97 |
+
|
98 |
+
```python
|
99 |
+
diff --git a/modules/models.py b/modules/models.py
|
100 |
+
index 232d5fa..de5b7a0 100644
|
101 |
+
--- a/modules/models.py
|
102 |
+
+++ b/modules/models.py
|
103 |
+
@@ -106,7 +106,7 @@ def load_tokenizer(model_name, model):
|
104 |
+
trust_remote_code=shared.args.trust_remote_code,
|
105 |
+
use_fast=False
|
106 |
+
)
|
107 |
+
- except ValueError:
|
108 |
+
+ except:
|
109 |
+
tokenizer = AutoTokenizer.from_pretrained(
|
110 |
+
path_to_model,
|
111 |
+
trust_remote_code=shared.args.trust_remote_code,
|
112 |
+
```
|
113 |
+
|
114 |
+
Since Llama-2-Ko uses FastTokenizer provided by HF tokenizers NOT sentencepiece package,
|
115 |
+
it is required to use `use_fast=True` option when initialize tokenizer.
|
116 |
+
|
117 |
+
Apple Sillicon does not support BF16 computing, use CPU instead. (BF16 is supported when using NVIDIA GPU)
|
118 |
+
|
119 |
+
## Citation
|
120 |
+
|
121 |
+
TBD
|
122 |
+
|
123 |
+
## Acknowledgement
|
124 |
+
|
125 |
+
- The training is supported by [TPU Research Cloud](https://sites.research.google/trc/) program.
|
126 |
+
- The training corpus is from [AI Hub](https://www.aihub.or.kr/), [Modu Corpus](https://corpus.korean.go.kr/) and [Korean Wikipedia](https://dumps.wikimedia.org/kowiki/).
|
corpus/AI_HUB
ADDED
@@ -0,0 +1,50 @@
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
1 |
+
754M ./001.๋ฌธ์์์ฝ.jsonl
|
2 |
+
397M ./006.์ ๋ฌธ๋ถ์ผํ์.jsonl
|
3 |
+
486M ./016.ํ์ _๋ฌธ์_๋์_๊ธฐ๊ณ๋
ํด_๋ฐ์ดํฐ.jsonl
|
4 |
+
563M ./017.๋ด์ค_๊ธฐ์ฌ_๊ธฐ๊ณ๋
ํด_๋ฐ์ดํฐ.jsonl
|
5 |
+
1.2G ./018.๋
ผ๋ฌธ์๋ฃ_์์ฝ_๋ฐ์ดํฐ.jsonl
|
6 |
+
88M ./019.๋ฒ๋ฅ ,_๊ท์ _(ํ๊ฒฐ์,_์ฝ๊ด_๋ฑ)_ํ
์คํธ_๋ถ์_๋ฐ์ดํฐ.jsonl
|
7 |
+
75M ./020.์ฃผ์ ๋ณ_ํ
์คํธ_์ผ์_๋ํ_๋ฐ์ดํฐ.jsonl
|
8 |
+
265M ./021.๋์์๋ฃ_๊ธฐ๊ณ๋
ํด.jsonl
|
9 |
+
30M ./021.์ฉ๋๋ณ_๋ชฉ์ ๋ํ_๋ฐ์ดํฐ.jsonl
|
10 |
+
566M ./022.์์ฝ๋ฌธ_๋ฐ_๋ ํฌํธ_์์ฑ_๋ฐ์ดํฐ.jsonl
|
11 |
+
19G ./023.์ ๋ฌธ๋ถ์ผ_๋ง๋ญ์น_๋ฐ์ดํฐ(๋ถ์ผ๋ณ_๊ฐ์ฒด๋ช
_์ธ์_ํฌํจ).jsonl
|
12 |
+
253M ./023.๋ฐฉ์ก_์ฝํ
์ธ _๋๋ณธ_์์ฝ_๋ฐ์ดํฐ.jsonl
|
13 |
+
918M ./025.์ผ์์ํ_๋ฐ_๊ตฌ์ด์ฒด_ํ-์_๋ฒ์ญ_๋ณ๋ ฌ_๋ง๋ญ์น_๋ฐ์ดํฐ.jsonl
|
14 |
+
307M ./026.ํ๊ตญ์ด-์์ด_๋ฒ์ญ_๋ง๋ญ์น_1.jsonl
|
15 |
+
1.3G ./026.๊ธฐ์ ๊ณผํ_๋ถ์ผ_ํ-์_๋ฒ์ญ_๋ณ๋ ฌ_๋ง๋ญ์น_๋ฐ์ดํฐ.jsonl
|
16 |
+
309M ./027.ํ๊ตญ์ด-์ค๊ตญ์ด_๋ฒ์ญ_๋ง๋ญ์น_1.jsonl
|
17 |
+
347M ./027.ํ๊ตญ์ด-์์ด_๋ฒ์ญ_๋ง๋ญ์น_2.jsonl
|
18 |
+
538M ./027.์ผ์์ํ_๋ฐ_๊ตฌ์ด์ฒด_ํ-์ค,_ํ-์ผ_๋ฒ์ญ_๋ณ๋ ฌ_๋ง๋ญ์น_๋ฐ์ดํฐ.jsonl
|
19 |
+
276M ./028.ํ๊ตญ์ด-์ค๊ตญ์ด_๋ฒ์ญ_๋ง๋ญ์น_2.jsonl
|
20 |
+
300M ./028.๋ค๊ตญ์ด_๊ตฌ์ด์ฒด_๋ฒ์ญ_๋ณ๋ ฌ_๋ง๋ญ์น_๋ฐ์ดํฐ.jsonl
|
21 |
+
410M ./029.ํ๊ตญ์ด-์ผ๋ณธ์ด_๋ฒ์ญ_๋ง๋ญ์น.jsonl
|
22 |
+
542K ./029.๋๊ท๋ชจ_๊ตฌ๋งค๋์_๊ธฐ๋ฐ_ํ๊ตญ์ด_๋ง๋ญ์น_๋ฐ์ดํฐ.jsonl
|
23 |
+
9.9G ./030.์น๋ฐ์ดํฐ_๊ธฐ๋ฐ_ํ๊ตญ์ด_๋ง๋ญ์น_๋ฐ์ดํฐ.jsonl
|
24 |
+
1.4G ./031.์จ๋ผ์ธ_๊ตฌ์ด์ฒด_๋ง๋ญ์น_๋ฐ์ดํฐ.jsonl
|
25 |
+
258M ./032.๋ฐฉ์ก์ฝํ
์ธ _ํ๊ตญ์ด-์์ด_๋ฒ์ญ_๋ง๋ญ์น.jsonl
|
26 |
+
84M ./032.ํนํ_๋ถ์ผ_์๋๋ถ๋ฅ_๋ฐ์ดํฐ.jsonl
|
27 |
+
239M ./034.๋ฐฉ์ก์ฝํ
์ธ _ํ๊ตญ์ด-์ ๋ฝ์ด_๋ฒ์ญ_๋ง๋ญ์น.jsonl
|
28 |
+
65M ./044.ํ๋ฅด์๋_๋ํ.jsonl
|
29 |
+
56M ./045.์ง์๊ฒ์_๋ํ.jsonl
|
30 |
+
67M ./046.๊ณต๊ฐํ_๋ํ.jsonl
|
31 |
+
85M ./049.์ผ๋ฐ์์_๋ฌธ์ฅ_์์ฑ_ํ๊ฐ_๋ฐ์ดํฐ.jsonl
|
32 |
+
13M ./050.๋ฐํ์ ํ(๋ฌธ์ด,๊ตฌ์ด,์ฑํ
)๋ณ_๊ธฐ๊ณ๋ฒ์ญ_๋ณ๋ ฌ_๋ง๋ญ์น.jsonl
|
33 |
+
193K ./052.๊ธฐ๊ณ๋ฒ์ญ_ํ์ง_๊ฒ์ฆ_๋ฐ์ดํฐ.jsonl
|
34 |
+
118M ./053.ํ๊ตญ์ด-๋ค๊ตญ์ด(์์ด_์ ์ธ)_๋ฒ์ญ_๋ง๋ญ์น(๊ธฐ์ ๊ณผํ).jsonl
|
35 |
+
127M ./054.ํ๊ตญ์ด-๋ค๊ตญ์ด_๋ฒ์ญ_๋ง๋ญ์น(๊ธฐ์ด๊ณผํ).jsonl
|
36 |
+
67M ./055.ํ๊ตญ์ด-๋ค๊ตญ์ด_๋ฒ์ญ_๋ง๋ญ์น(์ธ๋ฌธํ).jsonl
|
37 |
+
205M ./11.๊ธฐ๊ณ๋
ํด.jsonl
|
38 |
+
259M ./141.ํ๊ตญ์ด_๋ฉํฐ์ธ์
_๋ํ.jsonl
|
39 |
+
248M ./142.ํ๊ตญ์ด_์ง์๊ธฐ๋ฐ_๊ด๊ณ_๋ฐ์ดํฐ.jsonl
|
40 |
+
108M ./143.๋ฏผ์_์
๋ฌด_ํจ์จ,_์๋ํ๋ฅผ_์ํ_์ธ์ด_AI_ํ์ต๋ฐ์ดํฐ.jsonl
|
41 |
+
2.4G ./146.๋์์ฑ_๊ธฐ์ฌ_ํ์ง_๋ฐ์ดํฐ.jsonl
|
42 |
+
23M ./147.ํ
์คํธ_์ค๋ฆฌ๊ฒ์ฆ_๋ฐ์ดํฐ.jsonl
|
43 |
+
632M ./153.๊ธฐ์ ๊ณผํ_์์ฝ_๋ฐ์ดํฐ.jsonl
|
44 |
+
962M ./155.์ฐ์
์ ๋ณด_์ฐ๊ณ_์ฃผ์๊ตญ_ํนํ_์-ํ_๋ฐ์ดํฐ.jsonl
|
45 |
+
1.1G ./156.์ ๋ฌธ๋ถ์ผ_์-ํ,_์ค-ํ_๋ฒ์ญ_๋ง๋ญ์น(์ํ).jsonl
|
46 |
+
236M ./157.๋ฐฉ์ก_์ฝํ
์ธ _ํ-์ค,_ํ-์ผ_๋ฒ์ญ_๋ณ๋ ฌ_๋ง๋ญ์น_๋ฐ์ดํฐ.jsonl
|
47 |
+
418M ./157.์ถ์_์์ฝ_์ฌ์ค์ฑ_๊ฒ์ฆ_๋ฐ์ดํฐ.jsonl
|
48 |
+
12M ./158.์๊ฐ_ํํ_ํ์ง_๋ฐ์ดํฐ.jsonl
|
49 |
+
17M ./159.๋ฌธ์ฅ_์ ํ(์ถ๋ก ,_์์ธก_๋ฑ)_ํ๋จ_๋ฐ์ดํฐ.jsonl
|
50 |
+
1.4G ./297.SNS_๋ฐ์ดํฐ_๊ณ ๋ํ.jsonl
|
corpus/MODU_CORPUS
ADDED
@@ -0,0 +1,6 @@
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
1 |
+
์ผ์๋ํ๋ง๋ญ์น 2020, 2021
|
2 |
+
์ ๋ฌธ ๋ง๋ญ์น 2020, 2021, 2022
|
3 |
+
์ ์ฌ ๋ฌธ์ฅ ๋ง๋ญ์น
|
4 |
+
๋ฌธ์ ์์ฝ ๋ง๋ญ์น
|
5 |
+
๋ฌธ์ด ๋ง๋ญ์น
|
6 |
+
์๋ฏธ์ญ ๋ถ์ ๋ง๋ญ์น
|