MMLU-Pro-NoMath

Community Article Published July 11, 2024

MMLU-Pro-NoMath and MMLU-Pro-NoMath-Sml are subsets of MMLU-Pro with questions requiring multi-step calculation removed (43% of the original test set). We used claude-3.5-sonnet as the classifier. Questions were capped to an upper length limit to make logprobs evals faster and less likely to OOM. It's fast! 20 mins for NoMath and 7 mins for NoMath-Sml to evaluate gemma-2-9b using Eleuther harness.

image/png

Contents

🤔 Why do this?

In short, because we wanted a quick-to-run MMLU-Pro subset which is friendly to logprobs eval and primarily assessing knowledge & reasoning. One could simply run MMLU-Pro excluding the categories that have a heavy math component, but A. all categories (except history) have some amount of math, and B. the math-heavy categories have a lot of great non-math questions in areas we would like to assess!

MMLU-Pro was developed to address some shortcomings of the aging (in LLM timescales) MMLU benchmark. It adds 10 multi-choice options instead of MMLU's 4, which lowers the random baseline from 0.25 to 0.1, increasing the effective scoring range. And it ramps up the difficulty, adding some much-needed headroom to future-proof the test.

Of the 12032 items in MMLU-Pro, 5122 (43%) are applied math problems requiring multiple calculation steps to solve. This larger multi-step math component is a primary source of the extra difficulty of MMLU-Pro over the original MMLU.

One of the reasons the original MMLU was useful & widely used was that it primarily tested multi-domain knowledge and reasoning. It had a light math component but was formulated to be answerable without chain-of-thought (CoT) generative evaluations. We created a subset of MMLU-Pro to get the best of both worlds: More headroom, knowledge & reasoning focus, and friendly to logprobs evals.

🔍 NoMath Subset Details

Questions containing a math component were identified by presenting each test item to Claude-3.5-sonnet and asking it whether the question requires multi-step calculation to solve. The three options were "Y", "N" and "S", where "S" denoted simple math content that could typically be solved in one's head without multiple steps. In our subset, we allowed "Y" and "S" classifications, as our aim is to include as much of the applied & knowledge reasoning questions while filtering out questions that rely on CoT & complex calculation.

We also create a small version of the subset which has a balanced distribution of items per category.

One other quality of life change is that we constrained the length of questions. Around 1.5% of items had question lengths of 1400 - 4700 chars. By removing the outliers in this range, we keep the question lengths in a more consistent range which makes parallel logprobs evals faster and less likely to OOM.

One of the stated purposes of creating MMLU-Pro was to increase the headroom of the original MMLU, which was starting to saturate at the top of the ability range. Models typically score higher on these NoMath subsets compared to the full MMLU-Pro set, however we are still retaining most of the difficulty benefit of MMLU-Pro over MMLU. The current top open-source model (Qwen-2-72B) scores 82.3 on original MMLU, 64.4 on MMLU-Pro, and 68.1 on MMLU-Pro-NoMath. One key distinction is that with NoMath subsets, all of that headroom is knowledge/reasoning, not being gatekept by math ability.

image/png

🧮 What does logprobs evaluation mean?

Logprobs evaluation refers to a method for evaluating language models on multi-choice tests. Instead of having the model generate its answer by producing text inference, it uses the probabilities of output tokens to determine the model's answer. Here's how it works:

For each answer choice, the model calculates the log probability of generating that choice given the question and context. The log probabilities are typically calculated by summing the log probabilities of each token in the answer choice. The answer choice with the highest log probability is selected as the model's prediction. This prediction is then compared to the correct answer to determine if the model got the question right.

Key advantages of logprobs evaluation:

  • Speed: It's typically 5-10 times faster than generative methods, as it doesn't require the model to generate full text responses.
  • Consistency: It's less sensitive to changes in experimental setup, like differences in prompts or sampling methods.
  • Simplicity: It doesn't rely on being able to parse the generated answer, which can sometimes be ambiguous or incorrectly formatted.

However, on some tests (like MMLU-Pro!) logprobs evaluation can result in lower scores compared to generative methods with CoT prompting, as it doesn't allow the model to "show its work" or reason through the problem step-by-step.

❓ What's wrong with math & CoT?

  • The math gatekeeps the knowledge/reasoning evaluation. If the model can't accurately solve the math working, it will get the question wrong even if it understood the knowledge component.
  • It confounds the result. If we're testing math ability, knowledge and reasoning all together -- some categories have a lot of each type -- it can be hard to interpret why a model scores how it does. Is it scoring low because of weak math, weak knowledge, or both? We already have benchmarks that evaluate just math, so we don't need MMLU to do this.
  • The math questions in MMLU-Pro being formulated for generative CoT evaluation makes it less accessible to logprobs evaluation.
  • Results from logprobs evals are typically significantly worse than generative CoT evals, so results aren't very comparable.

We could run MMLU-Pro excluding the math-heavy categories, however most of the math-heavy categories have a significant non-math (knowledge or reasoning) component, which we would be losing if we excluded them. Instead, we categorise each item as either requiring multi-step math working, or not requiring this, and removing the math questions. This way, we are keeping all the knowledge & reasoning based questions for every category.

🏃 Run with Eleuther LM-Eval

(5-shot logprobs evaluation -- same config as Open LLM Leaderboard)

git clone https://github.com/sam-paech/lm-evaluation-harness.git -b mmlu-pro-irt
cd lm-evaluation-harness
pip install -e .
pip install git+https://github.com/huggingface/transformers.git

huggingface-cli login --token <mytoken>
export HF_HUB_ENABLE_HF_TRANSFER=1
lm_eval --model hf \
--model_args pretrained=google/gemma-2-9b-it,device_map=auto,max_length=4096,dtype=bfloat16 \
--tasks mmlu-pro-nomath,mmlu-pro-nomath-sml --device auto --batch_size auto

MMLU-Pro-NoMath -- gemma-2-9b-it

  • Runtime: 0:20:27
  • Accuracy: 0.5343
  • acc_stderr: 0.0060

MMLU-Pro-NoMath-Sml -- gemma-2-9b-it

  • Runtime: 0:06:50
  • Accuracy: 0.5301
  • acc_stderr: 0.0097

🚀 Run with TIGER-AI-Lab/MMLU-Pro via VLLM

(5-shot generative evaluation with CoT)

git clone https://github.com/EQ-Bench/MMLU-Pro.git
cd MMLU-Pro
pip install -r requirements.txt
pip install git+https://github.com/huggingface/transformers.git
pip install git+https://github.com/vllm-project/vllm.git
# for gemma-2 compatibility:
# export VLLM_ATTENTION_BACKEND=FLASHINFER
# Note: You might also have to add ", enforce_eager=True" to the `llm = LLM(...)` line in evaluate_from_local.py if you are short on vram.

python evaluate_from_local.py --save_dir eval_results --model "google/gemma-2-9b-it" --gpu_util 0.94 --dataset sam-paech/mmlu-pro-nomath-sml
  • Model: google/gemma-2-9b-it
  • Runtime: 0:35:15
  • Accuracy: 0.5908

🦙 Run with TIGER-AI-Lab/MMLU-Pro via llama.cpp

(5-shot generative evaluation with CoT)

screen
cd ~
git clone https://github.com/ggerganov/llama.cpp.git
cd llama.cpp
make LLAMA_CUDA=1
llama.cpp/llama-server -m gemma-2-9b-it-Q8_0.gguf --ctx-size 4096 --n-gpu-layers 200 --chat-template gemma2
[ctrl-a then d] to detach screen session

cd ~
git clone https://github.com/EQ-Bench/MMLU-Pro.git
cd MMLU-Pro
pip install -r requirements.txt
python evaluate_from_llama.cpp.py --dataset sam-paech/mmlu-pro-nomath-sml
  • Model: bartowski/gemma-2-9b-it-GGUF
  • Runtime: 1:06:43
  • Accuracy: 0.5646

🐳 Run with chigkim/Ollama-MMLU-Pro

(5-shot generative evaluation with CoT)

git clone https://github.com/EQ-Bench/Ollama-MMLU-Pro.git
[see the notebook for example]
  • Model: google/gemma-2-9b-it
  • Runtime:
  • Accuracy:
Score distribution comparison

📚 References

Credit to the MMLU-Pro test set for providing the source questions that this subset was derived from:

https://github.com/TIGER-AI-Lab/MMLU-Pro

@misc{wang2024mmlupro,
      title={MMLU-Pro: A More Robust and Challenging Multi-Task Language Understanding Benchmark}, 
      author={Yubo Wang and Xueguang Ma and Ge Zhang and Yuansheng Ni and Abhranil Chandra and Shiguang Guo and Weiming Ren and Aaran Arulraj and Xuan He and Ziyan Jiang and Tianle Li and Max Ku and Kai Wang and Alex Zhuang and Rongqi Fan and Xiang Yue and Wenhu Chen},
      year={2024},
      eprint={2406.01574},
      archivePrefix={arXiv},
      primaryClass={cs.CL}
}

And also to the original MMLU which MMLU-Pro heavily draws from:

https://github.com/hendrycks/test

@article{hendryckstest2021,
  title={Measuring Massive Multitask Language Understanding},
  author={Dan Hendrycks and Collin Burns and Steven Basart and Andy Zou and Mantas Mazeika and Dawn Song and Jacob Steinhardt},
  journal={Proceedings of the International Conference on Learning Representations (ICLR)},
  year={2021}
}
@article{hendrycks2021ethics,
  title={Aligning AI With Shared Human Values},
  author={Dan Hendrycks and Collin Burns and Steven Basart and Andrew Critch and Jerry Li and Dawn Song and Jacob Steinhardt},
  journal={Proceedings of the International Conference on Learning Representations (ICLR)},
  year={2021}
}