|
--- |
|
language: |
|
- en |
|
license: mit |
|
datasets: |
|
- anon8231489123/ShareGPT_Vicuna_unfiltered |
|
model-index: |
|
- name: yi6B_Vicuna |
|
results: |
|
- task: |
|
type: text-generation |
|
name: Text Generation |
|
dataset: |
|
name: AI2 Reasoning Challenge (25-Shot) |
|
type: ai2_arc |
|
config: ARC-Challenge |
|
split: test |
|
args: |
|
num_few_shot: 25 |
|
metrics: |
|
- type: acc_norm |
|
value: 46.16 |
|
name: normalized accuracy |
|
source: |
|
url: https://huggingface.co/spaces/HuggingFaceH4/open_llm_leaderboard?query=lorinma/yi6B_Vicuna |
|
name: Open LLM Leaderboard |
|
- task: |
|
type: text-generation |
|
name: Text Generation |
|
dataset: |
|
name: HellaSwag (10-Shot) |
|
type: hellaswag |
|
split: validation |
|
args: |
|
num_few_shot: 10 |
|
metrics: |
|
- type: acc_norm |
|
value: 69.3 |
|
name: normalized accuracy |
|
source: |
|
url: https://huggingface.co/spaces/HuggingFaceH4/open_llm_leaderboard?query=lorinma/yi6B_Vicuna |
|
name: Open LLM Leaderboard |
|
- task: |
|
type: text-generation |
|
name: Text Generation |
|
dataset: |
|
name: MMLU (5-Shot) |
|
type: cais/mmlu |
|
config: all |
|
split: test |
|
args: |
|
num_few_shot: 5 |
|
metrics: |
|
- type: acc |
|
value: 58.43 |
|
name: accuracy |
|
source: |
|
url: https://huggingface.co/spaces/HuggingFaceH4/open_llm_leaderboard?query=lorinma/yi6B_Vicuna |
|
name: Open LLM Leaderboard |
|
- task: |
|
type: text-generation |
|
name: Text Generation |
|
dataset: |
|
name: TruthfulQA (0-shot) |
|
type: truthful_qa |
|
config: multiple_choice |
|
split: validation |
|
args: |
|
num_few_shot: 0 |
|
metrics: |
|
- type: mc2 |
|
value: 48.11 |
|
source: |
|
url: https://huggingface.co/spaces/HuggingFaceH4/open_llm_leaderboard?query=lorinma/yi6B_Vicuna |
|
name: Open LLM Leaderboard |
|
- task: |
|
type: text-generation |
|
name: Text Generation |
|
dataset: |
|
name: Winogrande (5-shot) |
|
type: winogrande |
|
config: winogrande_xl |
|
split: validation |
|
args: |
|
num_few_shot: 5 |
|
metrics: |
|
- type: acc |
|
value: 65.67 |
|
name: accuracy |
|
source: |
|
url: https://huggingface.co/spaces/HuggingFaceH4/open_llm_leaderboard?query=lorinma/yi6B_Vicuna |
|
name: Open LLM Leaderboard |
|
- task: |
|
type: text-generation |
|
name: Text Generation |
|
dataset: |
|
name: GSM8k (5-shot) |
|
type: gsm8k |
|
config: main |
|
split: test |
|
args: |
|
num_few_shot: 5 |
|
metrics: |
|
- type: acc |
|
value: 18.42 |
|
name: accuracy |
|
source: |
|
url: https://huggingface.co/spaces/HuggingFaceH4/open_llm_leaderboard?query=lorinma/yi6B_Vicuna |
|
name: Open LLM Leaderboard |
|
--- |
|
|
|
|
|
**Bug**: Having a bit issue with the tokenizer, still figuring out...You can use the original Yi tokenizer configuratin. |
|
|
|
|
|
Reproduce Vicuna, but based on yi-6B. The training data I used was ShareGPT_V3_unfiltered_cleaned_split_no_imsorry.json. |
|
|
|
The training framework I used https://github.com/shibing624/MedicalGPT , train shell: |
|
``` |
|
CUDA_VISIBLE_DEVICES=0,1,2,3,5 torchrun --nproc_per_node 5 ../supervised_finetuning.py \ |
|
--model_type auto \ |
|
--model_name_or_path /data/llm/models/Pretrained/yi-6B/01ai/Yi-6B \ |
|
--tokenizer_name_or_path /data/llm/models/Pretrained/yi-6B/01ai/Yi-6B \ |
|
--train_file_dir ../data/finetune/vicuna/ \ |
|
--per_device_train_batch_size 2\ |
|
--do_train \ |
|
--max_train_samples -1 \ |
|
--num_train_epochs 3 \ |
|
--learning_rate 2e-5 \ |
|
--weight_decay 0. \ |
|
--bf16 \ |
|
--use_peft False \ |
|
--logging_strategy steps \ |
|
--logging_steps 10 \ |
|
--save_strategy epoch \ |
|
--save_total_limit 5 \ |
|
--gradient_accumulation_steps 1 \ |
|
--preprocessing_num_workers 8 \ |
|
--output_dir ../outputs/20240106_yi6B_vicuna \ |
|
--overwrite_output_dir \ |
|
--ddp_timeout 30000 \ |
|
--logging_first_step True \ |
|
--torch_dtype bfloat16 \ |
|
--device_map auto \ |
|
--report_to tensorboard \ |
|
--ddp_find_unused_parameters False \ |
|
--gradient_checkpointing True \ |
|
--cache_dir ./cache \ |
|
--model_max_length 4096 \ |
|
--deepspeed ../deepspeed_zero_stage2_config_no16.json \ |
|
--template_name yi |
|
``` |
|
|
|
The training used 5*A800 for 3 epochs |
|
``` |
|
***** train metrics ***** |
|
epoch = 3.0 |
|
train_loss = 0.3785 |
|
train_runtime = 1 day, 10:01:13.95 |
|
train_samples = 93204 |
|
train_samples_per_second = 2.24 |
|
train_steps_per_second = 0.224 |
|
``` |
|
|
|
Post-training inference is also using this repository: |
|
``` |
|
CUDA_VISIBLE_DEVICES=4 python gradio_demo.py --model_type auto --base_model /data/mn/shibing624/MedicalGPT-1.6.3-231215/outputs/20240106_yi6B_vicuna --tokenizer_path /data/mn/shibing624/MedicalGPT-1.6.3-231215/outputs/20240106_yi6B_vicuna --template_name yi --gpus 4 |
|
CUDA_VISIBLE_DEVICES=6 python inference.py --model_type auto --base_model /data/mn/shibing624/MedicalGPT-1.6.3-231215/outputs/20240106_yi6B_vicuna --template_name yi --gpus 6 --interactive --tokenizer_path /data/llm/models/Pretrained/yi-6B/01ai/Yi-6B |
|
``` |
|
|
|
We can see from some preliminary results, the conversation is natural and informative (unsurprisingly). |
|
|
|
![image/png](https://cdn-uploads.huggingface.co/production/uploads/6413d7be996b2e426f230fb7/WfQYyyLxtXA2KlePmIPQJ.png) |
|
|
|
Also we observe the unfiltering seems to be working! **Heads up** some examples are unsafe and inappropriate, this is entirely for research purposes, to test how alignment-filtered SFT data affect LLM's final output. |
|
|
|
![image/png](https://cdn-uploads.huggingface.co/production/uploads/6413d7be996b2e426f230fb7/pklSsljCRN34QuL2ZF2zU.png) |
|
|
|
![image/png](https://cdn-uploads.huggingface.co/production/uploads/6413d7be996b2e426f230fb7/22pTSVkBCVlQ5N8A8JBkF.png) |
|
|
|
**Update:** Evaluate on Open LLM Leaderboard: |
|
|
|
![image/png](https://cdn-uploads.huggingface.co/production/uploads/6413d7be996b2e426f230fb7/Xp11HLQqwh0HMSJgpr19n.png) |
|
# [Open LLM Leaderboard Evaluation Results](https://huggingface.co/spaces/HuggingFaceH4/open_llm_leaderboard) |
|
Detailed results can be found [here](https://huggingface.co/datasets/open-llm-leaderboard/details_lorinma__yi6B_Vicuna) |
|
|
|
| Metric |Value| |
|
|---------------------------------|----:| |
|
|Avg. |51.02| |
|
|AI2 Reasoning Challenge (25-Shot)|46.16| |
|
|HellaSwag (10-Shot) |69.30| |
|
|MMLU (5-Shot) |58.43| |
|
|TruthfulQA (0-shot) |48.11| |
|
|Winogrande (5-shot) |65.67| |
|
|GSM8k (5-shot) |18.42| |
|
|
|
|