Update README.md
Browse files
README.md
CHANGED
@@ -18,49 +18,46 @@ library_name: transformers
|
|
18 |
|
19 |
**Update Log**
|
20 |
|
21 |
-
- 2023.12.14:
|
22 |
|
23 |
# **Open-Llama-2-Ko** π¦π°π·
|
24 |
|
25 |
-
Open-Llama-2-Ko
|
26 |
-
Just like its predecessor, Llama-2-Ko operates within the broad range of generative text models that stretch from 7 billion to 70 billion parameters.
|
27 |
-
This repository focuses on the 7B pretrained version, which is tailored to fit the Hugging Face Transformers format.
|
28 |
|
29 |
-
The
|
30 |
-
including [AI Hub](https://www.aihub.or.kr), [Modu Corpus, λͺ¨λμ λ§λμΉ](https://corpus.korean.go.kr/) and [Korean Wikipedia](https://dumps.wikimedia.org/kowiki/).
|
31 |
|
32 |
-
|
33 |
|
34 |
## Model Details
|
35 |
|
36 |
-
**Model Developers
|
37 |
|
38 |
-
**Variations
|
39 |
|
40 |
-
**Input
|
41 |
|
42 |
-
**Output
|
43 |
|
44 |
-
**Model Architecture
|
45 |
|
46 |
-
Open-Llama-2-Ko is an auto-regressive language model that
|
47 |
|
48 |
-
|
49 |
|---|---|---|---|---|---|---|
|
50 |
-
|Llama 2|*A
|
51 |
|
52 |
-
**
|
53 |
|
54 |
-
|
55 |
|
56 |
- AI Hub: [corpus/AI_HUB](./corpus/AI_HUB)
|
57 |
-
-
|
58 |
-
-
|
59 |
- Modu Corpus: [corpus/MODU_CORPUS](./corpus/MODU_CORPUS)
|
60 |
|
61 |
-
|
62 |
|
63 |
-
Total
|
64 |
|
65 |
**Vocab Expansion**
|
66 |
|
@@ -95,7 +92,7 @@ TBD
|
|
95 |
|
96 |
TBD
|
97 |
|
98 |
-
##
|
99 |
|
100 |
-
-
|
101 |
-
- The training corpus
|
|
|
18 |
|
19 |
**Update Log**
|
20 |
|
21 |
+
- 2023.12.14: Initial Release of Open-Llama-2-Ko
|
22 |
|
23 |
# **Open-Llama-2-Ko** π¦π°π·
|
24 |
|
25 |
+
Open-Llama-2-Ko represents an advanced iteration of the Llama 2 model, featuring an expanded vocabulary and the inclusion of a Korean corpus for enhanced pretraining. Similar to its predecessor, Llama-2-Ko, this model operates within the range of generative text models, with parameter counts ranging from 7 billion to 70 billion. The focus of this repository is on the 7B pretrained version, designed to integrate seamlessly with the Hugging Face Transformers format.
|
|
|
|
|
26 |
|
27 |
+
The primary distinction between the Llama-2-Ko Series and Open-Llama-2-Ko lies in the dataset. Open-Llama-2-Ko exclusively utilizes publicly accessible Korean corpora, including sources such as [AI Hub](https://www.aihub.or.kr), [Modu Corpus, λͺ¨λμ λ§λμΉ](https://corpus.korean.go.kr/), and [Korean Wikipedia](https://dumps.wikimedia.org/kowiki/).
|
|
|
28 |
|
29 |
+
As training was conducted solely with publicly available corpora, this model is open for unrestricted use by everyone, adhering to the MIT License.
|
30 |
|
31 |
## Model Details
|
32 |
|
33 |
+
**Model Developers:** Junbum Lee (Beomi)
|
34 |
|
35 |
+
**Variations:** Open-Llama-2-Ko will be available in different parameter sizes β 7B and 13B β along with various pretrained options.
|
36 |
|
37 |
+
**Input:** The model accepts only text input.
|
38 |
|
39 |
+
**Output:** The model produces text output exclusively.
|
40 |
|
41 |
+
**Model Architecture:**
|
42 |
|
43 |
+
Open-Llama-2-Ko is an auto-regressive language model that leverages an optimized transformer architecture derived from Llama-2.
|
44 |
|
45 |
+
| |Training Data|Parameters|Content Length|GQA|Tokens|Learning Rate|
|
46 |
|---|---|---|---|---|---|---|
|
47 |
+
|Llama 2|*A curated mix of Publicly Accessible Korean Corpora*|7B|2k|β|>15B*|5e<sup>-5</sup>|
|
48 |
|
49 |
+
**Training Corpus**
|
50 |
|
51 |
+
The model was trained using selected datasets from AIHub and Modu Corpus. Detailed information about the training datasets is available below:
|
52 |
|
53 |
- AI Hub: [corpus/AI_HUB](./corpus/AI_HUB)
|
54 |
+
- Only the `Training` segment of the data was used.
|
55 |
+
- The `Validation` and `Test` segments were deliberately excluded.
|
56 |
- Modu Corpus: [corpus/MODU_CORPUS](./corpus/MODU_CORPUS)
|
57 |
|
58 |
+
The final JSONL dataset used to train this model is approximately 61GB in size.
|
59 |
|
60 |
+
Total token count: Approximately 15 billion tokens (*using the expanded tokenizer. With the original Llama tokenizer, >60 billion tokens.)
|
61 |
|
62 |
**Vocab Expansion**
|
63 |
|
|
|
92 |
|
93 |
TBD
|
94 |
|
95 |
+
## Acknowledgements
|
96 |
|
97 |
+
- Training support was provided by the [TPU Research Cloud](https://sites.research.google/trc/) program.
|
98 |
+
- The training corpus includes data from [AI Hub](https://www.aihub.or.kr/), [Modu Corpus](https://corpus.korean.go.kr/), and [Korean Wikipedia](https://dumps.wikimedia.org/kowiki/).
|