Xidong commited on
Commit
e00fe0b
โ€ข
1 Parent(s): 795388f

Upload README.md

Browse files
Files changed (1) hide show
  1. README.md +260 -0
README.md ADDED
@@ -0,0 +1,260 @@
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1
+ ---
2
+ license: apache-2.0
3
+ datasets:
4
+ - FreedomIntelligence/ApolloMoEDataset
5
+ language:
6
+ - ar
7
+ - en
8
+ - zh
9
+ - ko
10
+ - ja
11
+ - mn
12
+ - th
13
+ - vi
14
+ - lo
15
+ - mg
16
+ - de
17
+ - pt
18
+ - es
19
+ - fr
20
+ - ru
21
+ - it
22
+ - hr
23
+ - gl
24
+ - cs
25
+ - co
26
+ - la
27
+ - uk
28
+ - bs
29
+ - bg
30
+ - eo
31
+ - sq
32
+ - da
33
+ - sa
34
+ - 'no'
35
+ - gn
36
+ - sr
37
+ - sk
38
+ - gd
39
+ - lb
40
+ - hi
41
+ - ku
42
+ - mt
43
+ - he
44
+ - ln
45
+ - bm
46
+ - sw
47
+ - ig
48
+ - rw
49
+ - ha
50
+ metrics:
51
+ - accuracy
52
+ base_model:
53
+ - google/gemma-2-9b
54
+ pipeline_tag: question-answering
55
+ tags:
56
+ - biology
57
+ - medical
58
+ ---
59
+ # Democratizing Medical LLMs For Much More Languages
60
+
61
+ Covering 12 Major Languages including English, Chinese, French, Hindi, Spanish, Arabic, Russian, Japanese, Korean, German, Italian, Portuguese and 38 Minor Languages So far.
62
+ <center>
63
+
64
+
65
+
66
+ <p align="center">
67
+ ๐Ÿ“ƒ <a href="https://arxiv.org/abs/2410.10626" target="_blank">Paper</a> โ€ข ๐ŸŒ <a href="" target="_blank">Demo</a> โ€ข ๐Ÿค— <a href="https://huggingface.co/datasets/FreedomIntelligence/ApolloMoEDataset" target="_blank">ApolloMoEDataset</a> โ€ข ๐Ÿค— <a href="https://huggingface.co/datasets/FreedomIntelligence/ApolloMoEBench" target="_blank">ApolloMoEBench</a> โ€ข ๐Ÿค— <a href="https://huggingface.co/collections/FreedomIntelligence/apollomoe-and-apollo2-670ddebe3bb1ba1aebabbf2c" target="_blank">Models</a> โ€ข ๐ŸŒ <a href="https://github.com/FreedomIntelligence/Apollo" target="_blank">Apollo</a> โ€ข ๐ŸŒ <a href="https://github.com/FreedomIntelligence/ApolloMoE" target="_blank">ApolloMoE</a>
68
+ </p>
69
+
70
+
71
+
72
+ ![Apollo](assets/apollo_medium_final.png)
73
+
74
+
75
+ ## ๐ŸŒˆ Update
76
+
77
+ * **[2024.10.15]** ApolloMoE repo is published๏ผ๐ŸŽ‰
78
+
79
+
80
+ ## Architecture
81
+
82
+ <details>
83
+ <summary>Click to view the MoE routing image</summary>
84
+
85
+ ![ApolloMoE](/assets/hybrid_routing.png)
86
+
87
+ </details>
88
+
89
+ ## Results
90
+
91
+ ### Dense
92
+ ๐Ÿค— <a href="https://huggingface.co/FreedomIntelligence/Apollo2-0.5B" target="_blank">Apollo2-0.5B</a> โ€ข ๐Ÿค— <a href="https://huggingface.co/FreedomIntelligence/Apollo2-1.5B" target="_blank">Apollo2-1.5B</a> โ€ข ๐Ÿค— <a href="https://huggingface.co/FreedomIntelligence/Apollo2-2B" target="_blank">Apollo2-2B</a> โ€ข ๐Ÿค— <a href="https://huggingface.co/FreedomIntelligence/Apollo2-3.8B" target="_blank">Apollo2-3.8B</a> โ€ข ๐Ÿค— <a href="https://huggingface.co/FreedomIntelligence/Apollo2-7B" target="_blank">Apollo2-7B</a> โ€ข ๐Ÿค— <a href="https://huggingface.co/FreedomIntelligence/Apollo2-9B" target="_blank">Apollo2-9B</a>
93
+
94
+ <details>
95
+ <summary>Click to view the Dense Models Results</summary>
96
+
97
+ ![ApolloMoE](assets/dense_results.png)
98
+
99
+ </details>
100
+
101
+ ### Post-MoE
102
+ ๐Ÿค— <a href="https://huggingface.co/FreedomIntelligence/Apollo-MoE-0.5B" target="_blank">Apollo-MoE-0.5B</a> โ€ข ๐Ÿค— <a href="https://huggingface.co/FreedomIntelligence/Apollo-MoE-1.5B" target="_blank">Apollo-MoE-1.5B</a> โ€ข ๐Ÿค— <a href="https://huggingface.co/FreedomIntelligence/Apollo-MoE-7B" target="_blank">Apollo-MoE-7B</a>
103
+
104
+ <details>
105
+ <summary>Click to view the Post-MoE Models Results</summary>
106
+
107
+ ![ApolloMoE](assets/post_moe_results.png)
108
+
109
+ </details>
110
+
111
+
112
+
113
+
114
+ โ€‹
115
+
116
+
117
+ ## Usage Format
118
+ #### Apollo2
119
+ - 0.5B, 1.5B, 7B: User:{query}\nAssistant:{response}<|endoftext|>
120
+ - 2B, 9B: User:{query}\nAssistant:{response}\<eos\>
121
+ - 3.8B: <|user|>\n{query}<|end|><|assisitant|>\n{response}<|end|>
122
+
123
+ #### Apollo-MoE
124
+ - 0.5B, 1.5B, 7B: User:{query}\nAssistant:{response}<|endoftext|>
125
+
126
+ ## Dataset & Evaluation
127
+
128
+ - Dataset
129
+ ๐Ÿค— <a href="https://huggingface.co/datasets/FreedomIntelligence/ApolloMoEDataset" target="_blank">ApolloMoEDataset</a>
130
+
131
+ <details><summary>Click to expand</summary>
132
+
133
+ ![ApolloMoE](assets/Dataset.png)
134
+
135
+ - [Data category](https://huggingface.co/datasets/FreedomIntelligence/ApolloCorpus/tree/main/train)
136
+
137
+
138
+ </details>
139
+
140
+ - Evaluation
141
+ ๐Ÿค— <a href="https://huggingface.co/datasets/FreedomIntelligence/ApolloMoEBench" target="_blank">ApolloMoEBench</a>
142
+
143
+ <details><summary>Click to expand</summary>
144
+
145
+ - EN:
146
+ - [MedQA-USMLE](https://huggingface.co/datasets/GBaker/MedQA-USMLE-4-options)
147
+ - [MedMCQA](https://huggingface.co/datasets/medmcqa/viewer/default/test)
148
+ - [PubMedQA](https://huggingface.co/datasets/pubmed_qa): Because the results fluctuated too much, they were not used in the paper.
149
+ - [MMLU-Medical](https://huggingface.co/datasets/cais/mmlu)
150
+ - Clinical knowledge, Medical genetics, Anatomy, Professional medicine, College biology, College medicine
151
+ - ZH:
152
+ - [MedQA-MCMLE](https://huggingface.co/datasets/bigbio/med_qa/viewer/med_qa_zh_4options_bigbio_qa/test)
153
+ - [CMB-single](https://huggingface.co/datasets/FreedomIntelligence/CMB): Not used in the paper
154
+ - Randomly sample 2,000 multiple-choice questions with single answer.
155
+ - [CMMLU-Medical](https://huggingface.co/datasets/haonan-li/cmmlu)
156
+ - Anatomy, Clinical_knowledge, College_medicine, Genetics, Nutrition, Traditional_chinese_medicine, Virology
157
+ - [CExam](https://github.com/williamliujl/CMExam): Not used in the paper
158
+ - Randomly sample 2,000 multiple-choice questions
159
+
160
+
161
+ - ES: [Head_qa](https://huggingface.co/datasets/head_qa)
162
+ - FR:
163
+ - [Frenchmedmcqa](https://github.com/qanastek/FrenchMedMCQA)
164
+ - [MMLU_FR]
165
+ - Clinical knowledge, Medical genetics, Anatomy, Professional medicine, College biology, College medicine
166
+ - HI: [MMLU_HI](https://huggingface.co/datasets/FreedomIntelligence/MMLU_Hindi)
167
+ - Clinical knowledge, Medical genetics, Anatomy, Professional medicine, College biology, College medicine
168
+ - AR: [MMLU_AR](https://huggingface.co/datasets/FreedomIntelligence/MMLU_Arabic)
169
+ - Clinical knowledge, Medical genetics, Anatomy, Professional medicine, College biology, College medicine
170
+ - JA: [IgakuQA](https://github.com/jungokasai/IgakuQA)
171
+ - KO: [KorMedMCQA](https://huggingface.co/datasets/sean0042/KorMedMCQA)
172
+ - IT:
173
+ - [MedExpQA](https://huggingface.co/datasets/HiTZ/MedExpQA)
174
+ - [MMLU_IT]
175
+ - Clinical knowledge, Medical genetics, Anatomy, Professional medicine, College biology, College medicine
176
+ - DE: [BioInstructQA](https://huggingface.co/datasets/BioMistral/BioInstructQA): German part
177
+ - PT: [BioInstructQA](https://huggingface.co/datasets/BioMistral/BioInstructQA): Portuguese part
178
+ - RU: [RuMedBench](https://github.com/sb-ai-lab/MedBench)
179
+
180
+
181
+ โ€‹
182
+ โ€‹
183
+
184
+
185
+ </details>
186
+
187
+
188
+ ## Results reproduction
189
+ <details><summary>Click to expand</summary>
190
+
191
+
192
+ We take Gemma-2b as example
193
+ 1. Download Dataset for project:
194
+
195
+ ```
196
+ bash 0.download_data.sh
197
+ ```
198
+
199
+ 2. Prepare test and dev for specific model:
200
+
201
+
202
+ - Create test data for with special token, you can use ./util/check.ipynb to check models' special tokens
203
+
204
+ ```
205
+ bash 1.data_process_test&dev.sh
206
+ ```
207
+
208
+ 3. Prepare train data for specific model (Create tokenized data in advance):
209
+
210
+
211
+ - You can adjust data Training order and Training Epoch in this step
212
+
213
+ ```
214
+ bash 2.data_process_train.sh
215
+ ```
216
+
217
+ 4. Train the model
218
+
219
+
220
+ - If you want to train in Multi Nodes please refer to ./scripts/multi_node_train_*.sh
221
+
222
+
223
+
224
+
225
+ ```
226
+ bash 3.single_node_train_gemma.sh
227
+ ```
228
+
229
+
230
+ 5. Evaluate your model: Generate score for benchmark
231
+
232
+ ```
233
+ bash 4.eval.sh
234
+ ```
235
+
236
+ 6. Evaluate your model: Play with your ckpts in bash
237
+
238
+ ```
239
+ python ./src/evaluate/cli_demo.py --model_name='./ckpts/your/path/tfmr'
240
+ ```
241
+
242
+ </details>
243
+
244
+
245
+
246
+ ## Citation
247
+ Please use the following citation if you intend to use our dataset for training or evaluation:
248
+
249
+ ```
250
+ @misc{zheng2024efficientlydemocratizingmedicalllms,
251
+ title={Efficiently Democratizing Medical LLMs for 50 Languages via a Mixture of Language Family Experts},
252
+ author={Guorui Zheng and Xidong Wang and Juhao Liang and Nuo Chen and Yuping Zheng and Benyou Wang},
253
+ year={2024},
254
+ eprint={2410.10626},
255
+ archivePrefix={arXiv},
256
+ primaryClass={cs.CL},
257
+ url={https://arxiv.org/abs/2410.10626},
258
+ }
259
+ ```
260
+