Xidong commited on
Commit
f5248ba
1 Parent(s): 485014b

Upload README.md

Browse files
Files changed (1) hide show
  1. README.md +259 -0
README.md ADDED
@@ -0,0 +1,259 @@
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1
+ ---
2
+ license: apache-2.0
3
+ datasets:
4
+ - FreedomIntelligence/ApolloMoEDataset
5
+ language:
6
+ - ar
7
+ - en
8
+ - zh
9
+ - ko
10
+ - ja
11
+ - mn
12
+ - th
13
+ - vi
14
+ - lo
15
+ - mg
16
+ - de
17
+ - pt
18
+ - es
19
+ - fr
20
+ - ru
21
+ - it
22
+ - hr
23
+ - gl
24
+ - cs
25
+ - co
26
+ - la
27
+ - uk
28
+ - bs
29
+ - bg
30
+ - eo
31
+ - sq
32
+ - da
33
+ - sa
34
+ - 'no'
35
+ - gn
36
+ - sr
37
+ - sk
38
+ - gd
39
+ - lb
40
+ - hi
41
+ - ku
42
+ - mt
43
+ - he
44
+ - ln
45
+ - bm
46
+ - sw
47
+ - ig
48
+ - rw
49
+ - ha
50
+ metrics:
51
+ - accuracy
52
+ base_model:
53
+ - google/gemma-2-9b
54
+ pipeline_tag: question-answering
55
+ tags:
56
+ - biology
57
+ - medical
58
+ ---
59
+ # Democratizing Medical LLMs For Much More Languages
60
+
61
+ Covering 12 Major Languages including English, Chinese, French, Hindi, Spanish, Arabic, Russian, Japanese, Korean, German, Italian, Portuguese and 38 Minor Languages So far.
62
+ <center>
63
+
64
+
65
+
66
+ <p align="center">
67
+ 📃 <a href="https://arxiv.org/abs/2410.10626" target="_blank">Paper</a> • 🌐 <a href="" target="_blank">Demo</a> • 🤗 <a href="https://huggingface.co/datasets/FreedomIntelligence/ApolloMoEDataset" target="_blank">ApolloMoEDataset</a> • 🤗 <a href="https://huggingface.co/datasets/FreedomIntelligence/ApolloMoEBench" target="_blank">ApolloMoEBench</a> • 🤗 <a href="https://huggingface.co/collections/FreedomIntelligence/apollomoe-and-apollo2-670ddebe3bb1ba1aebabbf2c" target="_blank">Models</a> • 🌐 <a href="https://github.com/FreedomIntelligence/Apollo" target="_blank">Apollo</a>
68
+ </p>
69
+
70
+
71
+
72
+ ![Apollo](assets/apollo_medium_final.png)
73
+
74
+
75
+ ## 🌈 Update
76
+
77
+ * **[2024.10.15]** ApolloMoE repo is published!🎉
78
+
79
+
80
+ ## Architecture
81
+
82
+ <details>
83
+ <summary>Click to view the MoE routing image</summary>
84
+
85
+ ![ApolloMoE](/assets/hybrid_routing.png)
86
+
87
+ </details>
88
+
89
+ ## Results
90
+
91
+ ### Dense
92
+ 🤗 <a href="https://huggingface.co/FreedomIntelligence/Apollo2-0.5B" target="_blank">Apollo2-0.5B</a> • 🤗 <a href="https://huggingface.co/FreedomIntelligence/Apollo2-1.5B" target="_blank">Apollo2-1.5B</a> • 🤗 <a href="https://huggingface.co/FreedomIntelligence/Apollo2-2B" target="_blank">Apollo2-2B</a> • 🤗 <a href="https://huggingface.co/FreedomIntelligence/Apollo2-3.8B" target="_blank">Apollo2-3.8B</a> • 🤗 <a href="https://huggingface.co/FreedomIntelligence/Apollo2-7B" target="_blank">Apollo2-7B</a> • 🤗 <a href="https://huggingface.co/FreedomIntelligence/Apollo2-9B" target="_blank">Apollo2-9B</a>
93
+
94
+ <details>
95
+ <summary>Click to view the Dense Models Results</summary>
96
+
97
+ ![ApolloMoE](assets/dense_results.png)
98
+
99
+ </details>
100
+
101
+ ### Post-MoE
102
+ 🤗 <a href="https://huggingface.co/FreedomIntelligence/Apollo-MoE-0.5B" target="_blank">Apollo-MoE-0.5B</a> • 🤗 <a href="https://huggingface.co/FreedomIntelligence/Apollo-MoE-1.5B" target="_blank">Apollo-MoE-1.5B</a> • 🤗 <a href="https://huggingface.co/FreedomIntelligence/Apollo-MoE-7B" target="_blank">Apollo-MoE-7B</a>
103
+
104
+ <details>
105
+ <summary>Click to view the Post-MoE Models Results</summary>
106
+
107
+ ![ApolloMoE](assets/post_moe_results.png)
108
+
109
+ </details>
110
+
111
+
112
+
113
+
114
+
115
+
116
+
117
+ ## Usage Format
118
+ #### Apollo2
119
+ - 0.5B, 1.5B, 7B: User:{query}\nAssistant:{response}<|endoftext|>
120
+ - 2B, 9B: User:{query}\nAssistant:{response}\<eos\>
121
+ - 3.8B: <|user|>\n{query}<|end|><|assisitant|>\n{response}<|end|>
122
+
123
+ #### Apollo-MoE
124
+ - 0.5B, 1.5B, 7B: User:{query}\nAssistant:{response}<|endoftext|>
125
+
126
+ ## Dataset & Evaluation
127
+
128
+ - Dataset
129
+ 🤗 <a href="https://huggingface.co/datasets/FreedomIntelligence/ApolloMoEDataset" target="_blank">ApolloMoEDataset</a>
130
+
131
+ <details><summary>Click to expand</summary>
132
+
133
+ ![ApolloMoE](assets/Dataset.png)
134
+
135
+ - [Data category](https://huggingface.co/datasets/FreedomIntelligence/ApolloCorpus/tree/main/train)
136
+
137
+
138
+ </details>
139
+
140
+ - Evaluation
141
+ 🤗 <a href="https://huggingface.co/datasets/FreedomIntelligence/ApolloMoEBench" target="_blank">ApolloMoEBench</a>
142
+
143
+ <details><summary>Click to expand</summary>
144
+
145
+ - EN:
146
+ - [MedQA-USMLE](https://huggingface.co/datasets/GBaker/MedQA-USMLE-4-options)
147
+ - [MedMCQA](https://huggingface.co/datasets/medmcqa/viewer/default/test)
148
+ - [PubMedQA](https://huggingface.co/datasets/pubmed_qa): Because the results fluctuated too much, they were not used in the paper.
149
+ - [MMLU-Medical](https://huggingface.co/datasets/cais/mmlu)
150
+ - Clinical knowledge, Medical genetics, Anatomy, Professional medicine, College biology, College medicine
151
+ - ZH:
152
+ - [MedQA-MCMLE](https://huggingface.co/datasets/bigbio/med_qa/viewer/med_qa_zh_4options_bigbio_qa/test)
153
+ - [CMB-single](https://huggingface.co/datasets/FreedomIntelligence/CMB): Not used in the paper
154
+ - Randomly sample 2,000 multiple-choice questions with single answer.
155
+ - [CMMLU-Medical](https://huggingface.co/datasets/haonan-li/cmmlu)
156
+ - Anatomy, Clinical_knowledge, College_medicine, Genetics, Nutrition, Traditional_chinese_medicine, Virology
157
+ - [CExam](https://github.com/williamliujl/CMExam): Not used in the paper
158
+ - Randomly sample 2,000 multiple-choice questions
159
+
160
+
161
+ - ES: [Head_qa](https://huggingface.co/datasets/head_qa)
162
+ - FR:
163
+ - [Frenchmedmcqa](https://github.com/qanastek/FrenchMedMCQA)
164
+ - [MMLU_FR]
165
+ - Clinical knowledge, Medical genetics, Anatomy, Professional medicine, College biology, College medicine
166
+ - HI: [MMLU_HI](https://huggingface.co/datasets/FreedomIntelligence/MMLU_Hindi)
167
+ - Clinical knowledge, Medical genetics, Anatomy, Professional medicine, College biology, College medicine
168
+ - AR: [MMLU_AR](https://huggingface.co/datasets/FreedomIntelligence/MMLU_Arabic)
169
+ - Clinical knowledge, Medical genetics, Anatomy, Professional medicine, College biology, College medicine
170
+ - JA: [IgakuQA](https://github.com/jungokasai/IgakuQA)
171
+ - KO: [KorMedMCQA](https://huggingface.co/datasets/sean0042/KorMedMCQA)
172
+ - IT:
173
+ - [MedExpQA](https://huggingface.co/datasets/HiTZ/MedExpQA)
174
+ - [MMLU_IT]
175
+ - Clinical knowledge, Medical genetics, Anatomy, Professional medicine, College biology, College medicine
176
+ - DE: [BioInstructQA](https://huggingface.co/datasets/BioMistral/BioInstructQA): German part
177
+ - PT: [BioInstructQA](https://huggingface.co/datasets/BioMistral/BioInstructQA): Portuguese part
178
+ - RU: [RuMedBench](https://github.com/sb-ai-lab/MedBench)
179
+
180
+
181
+
182
+
183
+
184
+ </details>
185
+
186
+
187
+ ## Results reproduction
188
+ <details><summary>Click to expand</summary>
189
+
190
+
191
+ We take Gemma-2b as example
192
+ 1. Download Dataset for project:
193
+
194
+ ```
195
+ bash 0.download_data.sh
196
+ ```
197
+
198
+ 2. Prepare test and dev for specific model:
199
+
200
+
201
+ - Create test data for with special token, you can use ./util/check.ipynb to check models' special tokens
202
+
203
+ ```
204
+ bash 1.data_process_test&dev.sh
205
+ ```
206
+
207
+ 3. Prepare train data for specific model (Create tokenized data in advance):
208
+
209
+
210
+ - You can adjust data Training order and Training Epoch in this step
211
+
212
+ ```
213
+ bash 2.data_process_train.sh
214
+ ```
215
+
216
+ 4. Train the model
217
+
218
+
219
+ - If you want to train in Multi Nodes please refer to ./scripts/multi_node_train_*.sh
220
+
221
+
222
+
223
+
224
+ ```
225
+ bash 3.single_node_train_gemma.sh
226
+ ```
227
+
228
+
229
+ 5. Evaluate your model: Generate score for benchmark
230
+
231
+ ```
232
+ bash 4.eval.sh
233
+ ```
234
+
235
+ 6. Evaluate your model: Play with your ckpts in bash
236
+
237
+ ```
238
+ python ./src/evaluate/cli_demo.py --model_name='./ckpts/your/path/tfmr'
239
+ ```
240
+
241
+ </details>
242
+
243
+
244
+
245
+ ## Citation
246
+ Please use the following citation if you intend to use our dataset for training or evaluation:
247
+
248
+ ```
249
+ @misc{zheng2024efficientlydemocratizingmedicalllms,
250
+ title={Efficiently Democratizing Medical LLMs for 50 Languages via a Mixture of Language Family Experts},
251
+ author={Guorui Zheng and Xidong Wang and Juhao Liang and Nuo Chen and Yuping Zheng and Benyou Wang},
252
+ year={2024},
253
+ eprint={2410.10626},
254
+ archivePrefix={arXiv},
255
+ primaryClass={cs.CL},
256
+ url={https://arxiv.org/abs/2410.10626},
257
+ }
258
+ ```
259
+