ptrdvn commited on
Commit
b65414a
1 Parent(s): 02601cf

Update README.md

Browse files
Files changed (1) hide show
  1. README.md +39 -0
README.md CHANGED
@@ -67,6 +67,45 @@ We train on three sources of data to create this model:
67
  * [openchat/openchat_sharegpt4_dataset](https://huggingface.co/datasets/openchat/openchat_sharegpt4_dataset/resolve/main/sharegpt_gpt4.json) - 6,206 conversations
68
  * Multilingual conversations of humans talking to GPT-4.
69
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
70
  # workspace/llm_training/axolotl/llama3-multilingual/output_tagengo_openchat_megagon_8B_llama3
71
 
72
  This model is a fine-tuned version of [meta-llama/Meta-Llama-3-8B-Instruct](https://huggingface.co/meta-llama/Meta-Llama-3-8B-Instruct) on the above described dataset.
 
67
  * [openchat/openchat_sharegpt4_dataset](https://huggingface.co/datasets/openchat/openchat_sharegpt4_dataset/resolve/main/sharegpt_gpt4.json) - 6,206 conversations
68
  * Multilingual conversations of humans talking to GPT-4.
69
 
70
+
71
+ <details><summary>We prepare our data like so:</summary>
72
+
73
+ ```python
74
+ import pandas as pd
75
+ from datasets import Dataset, load_dataset, concatenate_datasets
76
+
77
+ ### Tagengo
78
+ gpt4_dataset = load_dataset("lightblue/tagengo-gpt4", split="train")
79
+ gpt4_dataset = gpt4_dataset.filter(lambda x: x["response"][1] == "stop")
80
+ ####
81
+
82
+ ### Megagon
83
+ megagon_df = pd.read_json(
84
+ "https://raw.githubusercontent.com/megagonlabs/instruction_ja/main/data/data.jsonl",
85
+ lines=True,
86
+ orient="records"
87
+ )
88
+ role_map = {"user": "human", "agent": "gpt"}
89
+ megagon_df["conversations"] = megagon_df.utterances.apply(lambda x: [{"from": role_map[y["name"]], "value": y["text"]} for y in x])
90
+ megagon_df["language"] = "Japanese"
91
+ megagon_df = megagon_df[["conversations", "language"]]
92
+ megagon_dataset = Dataset.from_pandas(df)
93
+ ###
94
+
95
+ ### Openchat
96
+ openchat_df = pd.read_json("https://huggingface.co/datasets/openchat/openchat_sharegpt4_dataset/resolve/main/sharegpt_gpt4.json?download=true")
97
+ openchat_df["conversations"] = openchat_df["items"]
98
+ openchat_dataset = Dataset.from_pandas(openchat_df)
99
+ ###
100
+
101
+
102
+ dataset = concatenate_datasets([gpt4_dataset, megagon_dataset, openchat_dataset])
103
+ dataset = dataset.filter(lambda x: not any([y["value"] is None for y in x["conversations"]]))
104
+ dataset.select_columns(["conversations"]).to_json("/workspace/llm_training/axolotl/llama3-multilingual/tagengo_openchat_megagon.json")
105
+ ```
106
+
107
+ </details>
108
+
109
  # workspace/llm_training/axolotl/llama3-multilingual/output_tagengo_openchat_megagon_8B_llama3
110
 
111
  This model is a fine-tuned version of [meta-llama/Meta-Llama-3-8B-Instruct](https://huggingface.co/meta-llama/Meta-Llama-3-8B-Instruct) on the above described dataset.