Xidong commited on
Commit
66b777b
1 Parent(s): 8ee3f71

Create README.md

Browse files
Files changed (1) hide show
  1. README.md +256 -0
README.md ADDED
@@ -0,0 +1,256 @@
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1
+ ---
2
+ license: apache-2.0
3
+ datasets:
4
+ - FreedomIntelligence/ApolloMoEDataset
5
+ language:
6
+ - ar
7
+ - en
8
+ - zh
9
+ - ko
10
+ - ja
11
+ - mn
12
+ - th
13
+ - vi
14
+ - lo
15
+ - mg
16
+ - de
17
+ - pt
18
+ - es
19
+ - fr
20
+ - ru
21
+ - it
22
+ - hr
23
+ - gl
24
+ - cs
25
+ - co
26
+ - la
27
+ - uk
28
+ - bs
29
+ - bg
30
+ - eo
31
+ - sq
32
+ - da
33
+ - sa
34
+ - 'no'
35
+ - gn
36
+ - sr
37
+ - sk
38
+ - gd
39
+ - lb
40
+ - hi
41
+ - ku
42
+ - mt
43
+ - he
44
+ - ln
45
+ - bm
46
+ - sw
47
+ - ig
48
+ - rw
49
+ - ha
50
+ metrics:
51
+ - accuracy
52
+ base_model:
53
+ - google/gemma-2-9b
54
+ pipeline_tag: question-answering
55
+ tags:
56
+ - biology
57
+ - medical
58
+ ---
59
+ # Democratizing Medical LLMs For Much More Languages
60
+
61
+ Covering 12 Major Languages including English, Chinese, French, Hindi, Spanish, Arabic, Russian, Japanese, Korean, German, Italian, Portuguese and 38 Minor Languages So far.
62
+ <center>
63
+
64
+
65
+
66
+ <p align="center">
67
+ 📃 <a href="https://arxiv.org/abs/2410.10626" target="_blank">Paper</a> • 🌐 <a href="" target="_blank">Demo</a> • 🤗 <a href="https://huggingface.co/datasets/FreedomIntelligence/ApolloMoEDataset" target="_blank">ApolloMoEDataset</a> • 🤗 <a href="https://huggingface.co/datasets/FreedomIntelligence/ApolloMoEBench" target="_blank">ApolloMoEBench</a> • 🤗 <a href="https://huggingface.co/collections/FreedomIntelligence/apollomoe-and-apollo2-670ddebe3bb1ba1aebabbf2c" target="_blank">Models</a> • 🌐 <a href="https://github.com/FreedomIntelligence/Apollo" target="_blank">Apollo</a>
68
+ </p>
69
+
70
+ ![Apollo](assets/apollo_medium_final.png)
71
+
72
+ ## 🌈 Update
73
+
74
+ * **[2024.10.15]** ApolloMoE repo is published!🎉
75
+
76
+
77
+ ## Architecture
78
+
79
+ <details>
80
+ <summary>Click to view the MoE routing image</summary>
81
+
82
+ ![ApolloMoE](/assets/hybrid_routing.png)
83
+
84
+ </details>
85
+
86
+ ## Results
87
+
88
+ ### Dense
89
+ 🤗 <a href="https://huggingface.co/FreedomIntelligence/Apollo2-0.5B" target="_blank">Apollo2-0.5B</a> • 🤗 <a href="https://huggingface.co/FreedomIntelligence/Apollo2-1.5B" target="_blank">Apollo2-1.5B</a> • 🤗 <a href="https://huggingface.co/FreedomIntelligence/Apollo2-2B" target="_blank">Apollo2-2B</a> • 🤗 <a href="https://huggingface.co/FreedomIntelligence/Apollo2-3.8B" target="_blank">Apollo2-3.8B</a> • 🤗 <a href="https://huggingface.co/FreedomIntelligence/Apollo2-7B" target="_blank">Apollo2-7B</a> • 🤗 <a href="https://huggingface.co/FreedomIntelligence/Apollo2-9B" target="_blank">Apollo2-9B</a>
90
+
91
+ <details>
92
+ <summary>Click to view the Dense Models Results</summary>
93
+
94
+ ![ApolloMoE](assets/dense_results.png)
95
+
96
+ </details>
97
+
98
+ ### Post-MoE
99
+ 🤗 <a href="https://huggingface.co/FreedomIntelligence/Apollo-MoE-0.5B" target="_blank">Apollo-MoE-0.5B</a> • 🤗 <a href="https://huggingface.co/FreedomIntelligence/Apollo-MoE-1.5B" target="_blank">Apollo-MoE-1.5B</a> • 🤗 <a href="https://huggingface.co/FreedomIntelligence/Apollo-MoE-7B" target="_blank">Apollo-MoE-7B</a>
100
+
101
+ <details>
102
+ <summary>Click to view the Post-MoE Models Results</summary>
103
+
104
+ ![ApolloMoE](assets/post_moe_results.png)
105
+
106
+ </details>
107
+
108
+
109
+
110
+
111
+
112
+
113
+
114
+ ## Usage Format
115
+ #### Apollo2
116
+ - 0.5B, 1.5B, 7B: User:{query}\nAssistant:{response}<|endoftext|>
117
+ - 2B, 9B: User:{query}\nAssistant:{response}\<eos\>
118
+ - 3.8B: <|user|>\n{query}<|end|><|assisitant|>\n{response}<|end|>
119
+
120
+ #### Apollo-MoE
121
+ - 0.5B, 1.5B, 7B: User:{query}\nAssistant:{response}<|endoftext|>
122
+
123
+ ## Dataset & Evaluation
124
+
125
+ - Dataset
126
+ 🤗 <a href="https://huggingface.co/datasets/FreedomIntelligence/ApolloMoEDataset" target="_blank">ApolloMoEDataset</a>
127
+
128
+ <details><summary>Click to expand</summary>
129
+
130
+ ![ApolloMoE](assets/Dataset.png)
131
+
132
+ - [Data category](https://huggingface.co/datasets/FreedomIntelligence/ApolloCorpus/tree/main/train)
133
+
134
+
135
+ </details>
136
+
137
+ - Evaluation
138
+ 🤗 <a href="https://huggingface.co/datasets/FreedomIntelligence/ApolloMoEBench" target="_blank">ApolloMoEBench</a>
139
+
140
+ <details><summary>Click to expand</summary>
141
+
142
+ - EN:
143
+ - [MedQA-USMLE](https://huggingface.co/datasets/GBaker/MedQA-USMLE-4-options)
144
+ - [MedMCQA](https://huggingface.co/datasets/medmcqa/viewer/default/test)
145
+ - [PubMedQA](https://huggingface.co/datasets/pubmed_qa): Because the results fluctuated too much, they were not used in the paper.
146
+ - [MMLU-Medical](https://huggingface.co/datasets/cais/mmlu)
147
+ - Clinical knowledge, Medical genetics, Anatomy, Professional medicine, College biology, College medicine
148
+ - ZH:
149
+ - [MedQA-MCMLE](https://huggingface.co/datasets/bigbio/med_qa/viewer/med_qa_zh_4options_bigbio_qa/test)
150
+ - [CMB-single](https://huggingface.co/datasets/FreedomIntelligence/CMB): Not used in the paper
151
+ - Randomly sample 2,000 multiple-choice questions with single answer.
152
+ - [CMMLU-Medical](https://huggingface.co/datasets/haonan-li/cmmlu)
153
+ - Anatomy, Clinical_knowledge, College_medicine, Genetics, Nutrition, Traditional_chinese_medicine, Virology
154
+ - [CExam](https://github.com/williamliujl/CMExam): Not used in the paper
155
+ - Randomly sample 2,000 multiple-choice questions
156
+
157
+
158
+ - ES: [Head_qa](https://huggingface.co/datasets/head_qa)
159
+ - FR:
160
+ - [Frenchmedmcqa](https://github.com/qanastek/FrenchMedMCQA)
161
+ - [MMLU_FR]
162
+ - Clinical knowledge, Medical genetics, Anatomy, Professional medicine, College biology, College medicine
163
+ - HI: [MMLU_HI](https://huggingface.co/datasets/FreedomIntelligence/MMLU_Hindi)
164
+ - Clinical knowledge, Medical genetics, Anatomy, Professional medicine, College biology, College medicine
165
+ - AR: [MMLU_AR](https://huggingface.co/datasets/FreedomIntelligence/MMLU_Arabic)
166
+ - Clinical knowledge, Medical genetics, Anatomy, Professional medicine, College biology, College medicine
167
+ - JA: [IgakuQA](https://github.com/jungokasai/IgakuQA)
168
+ - KO: [KorMedMCQA](https://huggingface.co/datasets/sean0042/KorMedMCQA)
169
+ - IT:
170
+ - [MedExpQA](https://huggingface.co/datasets/HiTZ/MedExpQA)
171
+ - [MMLU_IT]
172
+ - Clinical knowledge, Medical genetics, Anatomy, Professional medicine, College biology, College medicine
173
+ - DE: [BioInstructQA](https://huggingface.co/datasets/BioMistral/BioInstructQA): German part
174
+ - PT: [BioInstructQA](https://huggingface.co/datasets/BioMistral/BioInstructQA): Portuguese part
175
+ - RU: [RuMedBench](https://github.com/sb-ai-lab/MedBench)
176
+
177
+
178
+
179
+
180
+
181
+ </details>
182
+
183
+
184
+ ## Results reproduction
185
+ <details><summary>Click to expand</summary>
186
+
187
+
188
+ We take Gemma-2b as example
189
+ 1. Download Dataset for project:
190
+
191
+ ```
192
+ bash 0.download_data.sh
193
+ ```
194
+
195
+ 2. Prepare test and dev for specific model:
196
+
197
+
198
+ - Create test data for with special token, you can use ./util/check.ipynb to check models' special tokens
199
+
200
+ ```
201
+ bash 1.data_process_test&dev.sh
202
+ ```
203
+
204
+ 3. Prepare train data for specific model (Create tokenized data in advance):
205
+
206
+
207
+ - You can adjust data Training order and Training Epoch in this step
208
+
209
+ ```
210
+ bash 2.data_process_train.sh
211
+ ```
212
+
213
+ 4. Train the model
214
+
215
+
216
+ - If you want to train in Multi Nodes please refer to ./scripts/multi_node_train_*.sh
217
+
218
+
219
+
220
+
221
+ ```
222
+ bash 3.single_node_train_gemma.sh
223
+ ```
224
+
225
+
226
+ 5. Evaluate your model: Generate score for benchmark
227
+
228
+ ```
229
+ bash 4.eval.sh
230
+ ```
231
+
232
+ 6. Evaluate your model: Play with your ckpts in bash
233
+
234
+ ```
235
+ python ./src/evaluate/cli_demo.py --model_name='./ckpts/your/path/tfmr'
236
+ ```
237
+
238
+ </details>
239
+
240
+
241
+
242
+ ## Citation
243
+ Please use the following citation if you intend to use our dataset for training or evaluation:
244
+
245
+ ```
246
+ @misc{zheng2024efficientlydemocratizingmedicalllms,
247
+ title={Efficiently Democratizing Medical LLMs for 50 Languages via a Mixture of Language Family Experts},
248
+ author={Guorui Zheng and Xidong Wang and Juhao Liang and Nuo Chen and Yuping Zheng and Benyou Wang},
249
+ year={2024},
250
+ eprint={2410.10626},
251
+ archivePrefix={arXiv},
252
+ primaryClass={cs.CL},
253
+ url={https://arxiv.org/abs/2410.10626},
254
+ }
255
+ ```
256
+