beomi commited on
Commit
84ae877
β€’
1 Parent(s): b10288c

Update README.md

Browse files
Files changed (1) hide show
  1. README.md +21 -24
README.md CHANGED
@@ -18,49 +18,46 @@ library_name: transformers
18
 
19
  **Update Log**
20
 
21
- - 2023.12.14: First Release of Open-Llama-2-Ko
22
 
23
  # **Open-Llama-2-Ko** πŸ¦™πŸ‡°πŸ‡·
24
 
25
- Open-Llama-2-Ko serves as an advanced iteration of Llama 2, benefiting from an expanded vocabulary and the inclusion of a Korean corpus in its further pretraining.
26
- Just like its predecessor, Llama-2-Ko operates within the broad range of generative text models that stretch from 7 billion to 70 billion parameters.
27
- This repository focuses on the 7B pretrained version, which is tailored to fit the Hugging Face Transformers format.
28
 
29
- The main difference between Llama-2-Ko Series and Open-Llama-2-Ko is the dataset, Open-Llama-2-Ko series only used publicly accessable Korean corpus,
30
- including [AI Hub](https://www.aihub.or.kr), [Modu Corpus, λͺ¨λ‘μ˜ λ§λ­‰μΉ˜](https://corpus.korean.go.kr/) and [Korean Wikipedia](https://dumps.wikimedia.org/kowiki/).
31
 
32
- Since the train is done using only publicly available corpus, this model is opened to everyone without any restrictions. (*This model follows MIT License)
33
 
34
  ## Model Details
35
 
36
- **Model Developers** Junbum Lee (Beomi)
37
 
38
- **Variations** Open-Llama-2-Ko will come in a range of parameter sizes β€” 7B and 13B β€” as well as pretrained variations.
39
 
40
- **Input** Models input text only.
41
 
42
- **Output** Models generate text only.
43
 
44
- **Model Architecture**
45
 
46
- Open-Llama-2-Ko is an auto-regressive language model that uses an optimized transformer architecture based on Llama-2.
47
 
48
- ||Training Data|Params|Content Length|GQA|Tokens|LR|
49
  |---|---|---|---|---|---|---|
50
- |Llama 2|*A new mix of Publicly Accessable Korean Corpus*|7B|2k|&#10007;|>15B*|5e<sup>-5</sup>|
51
 
52
- **Train Corpus**
53
 
54
- Trained with selected corpus within AIHub/Modu Corpus. The detailed dataset list to train this model is available below:
55
 
56
  - AI Hub: [corpus/AI_HUB](./corpus/AI_HUB)
57
- - Used only `Training` part of the data.
58
- - Explicitly dropped `Validation`/`Test` part of the data.
59
  - Modu Corpus: [corpus/MODU_CORPUS](./corpus/MODU_CORPUS)
60
 
61
- Final JSONL dataset to trian this model is: 61GB.
62
 
63
- Total amount of tokens: (Approx.) 15B Tokens (*using expanded tokenizer. with original Llama tokenizer, >60B tokens.)
64
 
65
  **Vocab Expansion**
66
 
@@ -95,7 +92,7 @@ TBD
95
 
96
  TBD
97
 
98
- ## Acknowledgement
99
 
100
- - The training is supported by [TPU Research Cloud](https://sites.research.google/trc/) program.
101
- - The training corpus is from [AI Hub](https://www.aihub.or.kr/), [Modu Corpus](https://corpus.korean.go.kr/) and [Korean Wikipedia](https://dumps.wikimedia.org/kowiki/).
 
18
 
19
  **Update Log**
20
 
21
+ - 2023.12.14: Initial Release of Open-Llama-2-Ko
22
 
23
  # **Open-Llama-2-Ko** πŸ¦™πŸ‡°πŸ‡·
24
 
25
+ Open-Llama-2-Ko represents an advanced iteration of the Llama 2 model, featuring an expanded vocabulary and the inclusion of a Korean corpus for enhanced pretraining. Similar to its predecessor, Llama-2-Ko, this model operates within the range of generative text models, with parameter counts ranging from 7 billion to 70 billion. The focus of this repository is on the 7B pretrained version, designed to integrate seamlessly with the Hugging Face Transformers format.
 
 
26
 
27
+ The primary distinction between the Llama-2-Ko Series and Open-Llama-2-Ko lies in the dataset. Open-Llama-2-Ko exclusively utilizes publicly accessible Korean corpora, including sources such as [AI Hub](https://www.aihub.or.kr), [Modu Corpus, λͺ¨λ‘μ˜ λ§λ­‰μΉ˜](https://corpus.korean.go.kr/), and [Korean Wikipedia](https://dumps.wikimedia.org/kowiki/).
 
28
 
29
+ As training was conducted solely with publicly available corpora, this model is open for unrestricted use by everyone, adhering to the MIT License.
30
 
31
  ## Model Details
32
 
33
+ **Model Developers:** Junbum Lee (Beomi)
34
 
35
+ **Variations:** Open-Llama-2-Ko will be available in different parameter sizes β€” 7B and 13B β€” along with various pretrained options.
36
 
37
+ **Input:** The model accepts only text input.
38
 
39
+ **Output:** The model produces text output exclusively.
40
 
41
+ **Model Architecture:**
42
 
43
+ Open-Llama-2-Ko is an auto-regressive language model that leverages an optimized transformer architecture derived from Llama-2.
44
 
45
+ | |Training Data|Parameters|Content Length|GQA|Tokens|Learning Rate|
46
  |---|---|---|---|---|---|---|
47
+ |Llama 2|*A curated mix of Publicly Accessible Korean Corpora*|7B|2k|✘|>15B*|5e<sup>-5</sup>|
48
 
49
+ **Training Corpus**
50
 
51
+ The model was trained using selected datasets from AIHub and Modu Corpus. Detailed information about the training datasets is available below:
52
 
53
  - AI Hub: [corpus/AI_HUB](./corpus/AI_HUB)
54
+ - Only the `Training` segment of the data was used.
55
+ - The `Validation` and `Test` segments were deliberately excluded.
56
  - Modu Corpus: [corpus/MODU_CORPUS](./corpus/MODU_CORPUS)
57
 
58
+ The final JSONL dataset used to train this model is approximately 61GB in size.
59
 
60
+ Total token count: Approximately 15 billion tokens (*using the expanded tokenizer. With the original Llama tokenizer, >60 billion tokens.)
61
 
62
  **Vocab Expansion**
63
 
 
92
 
93
  TBD
94
 
95
+ ## Acknowledgements
96
 
97
+ - Training support was provided by the [TPU Research Cloud](https://sites.research.google/trc/) program.
98
+ - The training corpus includes data from [AI Hub](https://www.aihub.or.kr/), [Modu Corpus](https://corpus.korean.go.kr/), and [Korean Wikipedia](https://dumps.wikimedia.org/kowiki/).