metadata
license: apache-2.0
datasets:
- abacusai/MetaMathFewshot
- shahules786/orca-chat
- anon8231489123/ShareGPT_Vicuna_unfiltered
base_model: mistralai/Mistral-7B-v0.1
model-index:
- name: Fewshot-Metamath-OrcaVicuna-Mistral
results:
- task:
type: text-generation
name: Text Generation
dataset:
name: AI2 Reasoning Challenge (25-Shot)
type: ai2_arc
config: ARC-Challenge
split: test
args:
num_few_shot: 25
metrics:
- type: acc_norm
value: 59.64
name: normalized accuracy
source:
url: >-
https://huggingface.co/spaces/HuggingFaceH4/open_llm_leaderboard?query=abacusai/Fewshot-Metamath-OrcaVicuna-Mistral
name: Open LLM Leaderboard
- task:
type: text-generation
name: Text Generation
dataset:
name: HellaSwag (10-Shot)
type: hellaswag
split: validation
args:
num_few_shot: 10
metrics:
- type: acc_norm
value: 81.82
name: normalized accuracy
source:
url: >-
https://huggingface.co/spaces/HuggingFaceH4/open_llm_leaderboard?query=abacusai/Fewshot-Metamath-OrcaVicuna-Mistral
name: Open LLM Leaderboard
- task:
type: text-generation
name: Text Generation
dataset:
name: MMLU (5-Shot)
type: cais/mmlu
config: all
split: test
args:
num_few_shot: 5
metrics:
- type: acc
value: 61.69
name: accuracy
source:
url: >-
https://huggingface.co/spaces/HuggingFaceH4/open_llm_leaderboard?query=abacusai/Fewshot-Metamath-OrcaVicuna-Mistral
name: Open LLM Leaderboard
- task:
type: text-generation
name: Text Generation
dataset:
name: TruthfulQA (0-shot)
type: truthful_qa
config: multiple_choice
split: validation
args:
num_few_shot: 0
metrics:
- type: mc2
value: 53.23
source:
url: >-
https://huggingface.co/spaces/HuggingFaceH4/open_llm_leaderboard?query=abacusai/Fewshot-Metamath-OrcaVicuna-Mistral
name: Open LLM Leaderboard
- task:
type: text-generation
name: Text Generation
dataset:
name: Winogrande (5-shot)
type: winogrande
config: winogrande_xl
split: validation
args:
num_few_shot: 5
metrics:
- type: acc
value: 78.45
name: accuracy
source:
url: >-
https://huggingface.co/spaces/HuggingFaceH4/open_llm_leaderboard?query=abacusai/Fewshot-Metamath-OrcaVicuna-Mistral
name: Open LLM Leaderboard
- task:
type: text-generation
name: Text Generation
dataset:
name: GSM8k (5-shot)
type: gsm8k
config: main
split: test
args:
num_few_shot: 5
metrics:
- type: acc
value: 69.14
name: accuracy
source:
url: >-
https://huggingface.co/spaces/HuggingFaceH4/open_llm_leaderboard?query=abacusai/Fewshot-Metamath-OrcaVicuna-Mistral
name: Open LLM Leaderboard
This model was trained on our MetamathFewshot dataset, as well as the Vicuna dataset and the OrcaChat dataset.
It has been finetuned from base Mistral 7B
Usage
This model uses a specific prompt format which is encoded as a chat template. To apply this, you can use the tokenizer.apply_chat_template() method of the attached tokenizer:
messages = [
{"role": "user", "content": "What is the capital of Spain?"},
{"role": "assistant", "content": "The capital of Spain is Madrid."}
]
gen_input = tokenizer.apply_chat_template(message, return_tensors="pt")
model.generate(**gen_input)
Evaluation Results
HuggingFace Leaderboard
Average | ARC | HellaSwag | MMLU | TruthfulQA | Winogrande | GSM8K |
---|---|---|---|---|---|---|
67.33 | 59.64 | 81.82 | 61.69 | 53.23 | 78.45 | 69.14 |
For comparison the GSM8K score for the original metamath/MetaMath-Mistral-7B
was 68.84 and average score was 65.78.
MT-Bench
Turn 1 | Turn 2 | Average |
---|---|---|
6.90 | 6.52 | 6.71 |
Training Details
Instruction tuned with the following parameters:
- LORA, Rank 8, Alpha 16, Dropout 0.05, all modules (QKV and MLP)
- 3 epochs
- Micro Batch Size 32 over 4xH100, gradient accumulation steps = 1
- AdamW with learning rate 5e-5
Bias, Risks, and Limitations
The model has not been evaluated for safety and is only intended for research and experiments.
Open LLM Leaderboard Evaluation Results
Detailed results can be found here
Metric | Value |
---|---|
Avg. | 67.33 |
AI2 Reasoning Challenge (25-Shot) | 59.64 |
HellaSwag (10-Shot) | 81.82 |
MMLU (5-Shot) | 61.69 |
TruthfulQA (0-shot) | 53.23 |
Winogrande (5-shot) | 78.45 |
GSM8k (5-shot) | 69.14 |