beomi
/

open-llama-2-ko-7b

@@ -18,49 +18,46 @@ library_name: transformers
 **Update Log**
-- 2023.12.14: First Release of Open-Llama-2-Ko
 # **Open-Llama-2-Ko** 🦙🇰🇷
-Open-Llama-2-Ko serves as an advanced iteration of Llama 2, benefiting from an expanded vocabulary and the inclusion of a Korean corpus in its further pretraining.
-Just like its predecessor, Llama-2-Ko operates within the broad range of generative text models that stretch from 7 billion to 70 billion parameters.
-This repository focuses on the 7B pretrained version, which is tailored to fit the Hugging Face Transformers format.
-The main difference between Llama-2-Ko Series and Open-Llama-2-Ko is the dataset, Open-Llama-2-Ko series only used publicly accessable Korean corpus,
-including [AI Hub](https://www.aihub.or.kr), [Modu Corpus, 모두의 말뭉치](https://corpus.korean.go.kr/) and [Korean Wikipedia](https://dumps.wikimedia.org/kowiki/).
-Since the train is done using only publicly available corpus, this model is opened to everyone without any restrictions. (*This model follows MIT License)
 ## Model Details
-**Model Developers** Junbum Lee (Beomi)
-**Variations** Open-Llama-2-Ko will come in a range of parameter sizes — 7B and 13B — as well as pretrained variations.
-**Input** Models input text only.
-**Output** Models generate text only.
-**Model Architecture**
-Open-Llama-2-Ko is an auto-regressive language model that uses an optimized transformer architecture based on Llama-2.
-||Training Data|Params|Content Length|GQA|Tokens|LR|
 |---|---|---|---|---|---|---|
-|Llama 2|*A new mix of Publicly Accessable Korean Corpus*|7B|2k|&#10007;|>15B*|5e<sup>-5</sup>|
-**Train Corpus**
-Trained with selected corpus within AIHub/Modu Corpus. The detailed dataset list to train this model is available below:
 - AI Hub: [corpus/AI_HUB](./corpus/AI_HUB)
-  - Used only `Training` part of the data.
-  - Explicitly dropped `Validation`/`Test` part of the data.
 - Modu Corpus: [corpus/MODU_CORPUS](./corpus/MODU_CORPUS)
-Final JSONL dataset to trian this model is: 61GB.
-Total amount of tokens: (Approx.) 15B Tokens (*using expanded tokenizer. with original Llama tokenizer, >60B tokens.)
 **Vocab Expansion**
@@ -95,7 +92,7 @@ TBD
 TBD
-## Acknowledgement
-- The training is supported by [TPU Research Cloud](https://sites.research.google/trc/) program.
-- The training corpus is from [AI Hub](https://www.aihub.or.kr/), [Modu Corpus](https://corpus.korean.go.kr/) and [Korean Wikipedia](https://dumps.wikimedia.org/kowiki/).

 **Update Log**
+- 2023.12.14: Initial Release of Open-Llama-2-Ko
 # **Open-Llama-2-Ko** 🦙🇰🇷
+Open-Llama-2-Ko represents an advanced iteration of the Llama 2 model, featuring an expanded vocabulary and the inclusion of a Korean corpus for enhanced pretraining. Similar to its predecessor, Llama-2-Ko, this model operates within the range of generative text models, with parameter counts ranging from 7 billion to 70 billion. The focus of this repository is on the 7B pretrained version, designed to integrate seamlessly with the Hugging Face Transformers format.
+The primary distinction between the Llama-2-Ko Series and Open-Llama-2-Ko lies in the dataset. Open-Llama-2-Ko exclusively utilizes publicly accessible Korean corpora, including sources such as [AI Hub](https://www.aihub.or.kr), [Modu Corpus, 모두의 말뭉치](https://corpus.korean.go.kr/), and [Korean Wikipedia](https://dumps.wikimedia.org/kowiki/).
+As training was conducted solely with publicly available corpora, this model is open for unrestricted use by everyone, adhering to the MIT License.
 ## Model Details
+**Model Developers:** Junbum Lee (Beomi)
+**Variations:** Open-Llama-2-Ko will be available in different parameter sizes — 7B and 13B — along with various pretrained options.
+**Input:** The model accepts only text input.
+**Output:** The model produces text output exclusively.
+**Model Architecture:**
+Open-Llama-2-Ko is an auto-regressive language model that leverages an optimized transformer architecture derived from Llama-2.
+| |Training Data|Parameters|Content Length|GQA|Tokens|Learning Rate|
 |---|---|---|---|---|---|---|
+|Llama 2|*A curated mix of Publicly Accessible Korean Corpora*|7B|2k|✘|>15B*|5e<sup>-5</sup>|
+**Training Corpus**
+The model was trained using selected datasets from AIHub and Modu Corpus. Detailed information about the training datasets is available below:
 - AI Hub: [corpus/AI_HUB](./corpus/AI_HUB)
+  - Only the `Training` segment of the data was used.
+  - The `Validation` and `Test` segments were deliberately excluded.
 - Modu Corpus: [corpus/MODU_CORPUS](./corpus/MODU_CORPUS)
+The final JSONL dataset used to train this model is approximately 61GB in size.
+Total token count: Approximately 15 billion tokens (*using the expanded tokenizer. With the original Llama tokenizer, >60 billion tokens.)
 **Vocab Expansion**
 TBD
+## Acknowledgements
+- Training support was provided by the [TPU Research Cloud](https://sites.research.google/trc/) program.
+- The training corpus includes data from [AI Hub](https://www.aihub.or.kr/), [Modu Corpus](https://corpus.korean.go.kr/), and [Korean Wikipedia](https://dumps.wikimedia.org/kowiki/).