Update README.md
Browse files
README.md
CHANGED
@@ -67,6 +67,45 @@ We train on three sources of data to create this model:
|
|
67 |
* [openchat/openchat_sharegpt4_dataset](https://huggingface.co/datasets/openchat/openchat_sharegpt4_dataset/resolve/main/sharegpt_gpt4.json) - 6,206 conversations
|
68 |
* Multilingual conversations of humans talking to GPT-4.
|
69 |
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
70 |
# workspace/llm_training/axolotl/llama3-multilingual/output_tagengo_openchat_megagon_8B_llama3
|
71 |
|
72 |
This model is a fine-tuned version of [meta-llama/Meta-Llama-3-8B-Instruct](https://huggingface.co/meta-llama/Meta-Llama-3-8B-Instruct) on the above described dataset.
|
|
|
67 |
* [openchat/openchat_sharegpt4_dataset](https://huggingface.co/datasets/openchat/openchat_sharegpt4_dataset/resolve/main/sharegpt_gpt4.json) - 6,206 conversations
|
68 |
* Multilingual conversations of humans talking to GPT-4.
|
69 |
|
70 |
+
|
71 |
+
<details><summary>We prepare our data like so:</summary>
|
72 |
+
|
73 |
+
```python
|
74 |
+
import pandas as pd
|
75 |
+
from datasets import Dataset, load_dataset, concatenate_datasets
|
76 |
+
|
77 |
+
### Tagengo
|
78 |
+
gpt4_dataset = load_dataset("lightblue/tagengo-gpt4", split="train")
|
79 |
+
gpt4_dataset = gpt4_dataset.filter(lambda x: x["response"][1] == "stop")
|
80 |
+
####
|
81 |
+
|
82 |
+
### Megagon
|
83 |
+
megagon_df = pd.read_json(
|
84 |
+
"https://raw.githubusercontent.com/megagonlabs/instruction_ja/main/data/data.jsonl",
|
85 |
+
lines=True,
|
86 |
+
orient="records"
|
87 |
+
)
|
88 |
+
role_map = {"user": "human", "agent": "gpt"}
|
89 |
+
megagon_df["conversations"] = megagon_df.utterances.apply(lambda x: [{"from": role_map[y["name"]], "value": y["text"]} for y in x])
|
90 |
+
megagon_df["language"] = "Japanese"
|
91 |
+
megagon_df = megagon_df[["conversations", "language"]]
|
92 |
+
megagon_dataset = Dataset.from_pandas(df)
|
93 |
+
###
|
94 |
+
|
95 |
+
### Openchat
|
96 |
+
openchat_df = pd.read_json("https://huggingface.co/datasets/openchat/openchat_sharegpt4_dataset/resolve/main/sharegpt_gpt4.json?download=true")
|
97 |
+
openchat_df["conversations"] = openchat_df["items"]
|
98 |
+
openchat_dataset = Dataset.from_pandas(openchat_df)
|
99 |
+
###
|
100 |
+
|
101 |
+
|
102 |
+
dataset = concatenate_datasets([gpt4_dataset, megagon_dataset, openchat_dataset])
|
103 |
+
dataset = dataset.filter(lambda x: not any([y["value"] is None for y in x["conversations"]]))
|
104 |
+
dataset.select_columns(["conversations"]).to_json("/workspace/llm_training/axolotl/llama3-multilingual/tagengo_openchat_megagon.json")
|
105 |
+
```
|
106 |
+
|
107 |
+
</details>
|
108 |
+
|
109 |
# workspace/llm_training/axolotl/llama3-multilingual/output_tagengo_openchat_megagon_8B_llama3
|
110 |
|
111 |
This model is a fine-tuned version of [meta-llama/Meta-Llama-3-8B-Instruct](https://huggingface.co/meta-llama/Meta-Llama-3-8B-Instruct) on the above described dataset.
|