English Benchmarking

#4
by msalhab96 - opened

I'm trying to replicate the results of lm-eval using lm-harness as mentioned in the paper under zero shot-setting, but it seems the values in the paper does not match the values I get for the foundation model, is it the same model that has been benchmarked in the paper?

Yes, its the same model. Please share the results that you got and the metric you are trying to compare.

I tried HellaSwag, arc challenge, MMLU, zero-shot as in the paper

using https://github.com/EleutherAI/lm-evaluation-harness

for arc_challange here how my command looks like

python main.py --model hf-causal --model_args pretrained=inception-mbzuai/jais-13b,trust_remote_code=True --tasks arc_challenge --num_fewshot 0
Inception org

Thanks, Could you please share the results (actual numbers that you got) and also what metric are you looking at "acc" or "acc_norm"?

Here is the results for hellaswag, the reported number in the paper is 71.8 while the acc_norm is 43.75

{
  "results": {
    "hellaswag": {
      "acc": 0.37134037044413465,
      "acc_stderr": 0.0048217577341567374,
      "acc_norm": 0.43756223859788884,
      "acc_norm_stderr": 0.004950723480149755
    }
  },
  "versions": {
    "hellaswag": 0
  },
  "config": {
    "model": "hf-causal",
    "model_args": "pretrained=inception-mbzuai/jais-13b,trust_remote_code=True",
    "num_fewshot": 0,
    "batch_size": null,
    "batch_sizes": [],
    "device": null,
    "no_cache": false,
    "limit": null,
    "bootstrap_iters": 100000,
    "description_dict": {}
  }
}

The command used

python main.py --model hf-causal --model_args pretrained=inception-mbzuai/jais-13b,trust_remote_code=True --tasks hellaswag --num_fewshot 0

Thank you for the message!

We have verified results for the above mentioned tasks and we have got same numbers as we reported in our paper. The reported results are reproducible when we load the model in original precision i.e. float32.
However, when we load the model in float16 we got the results which you noted above. Therefore, the lower precision seems to be lowering the performance of the model.

We suggest you to load the model in original float32 precision to reproduce the results.
You can do this by reserving enough GPU/CPU memory (either with single gpu or multiple gpus) around 60 GB . The following command can be used to run the evaluations

python main.py --model hf-causal-experimental --model_args use_accelerate=True,pretrained=inception-mbzuai/jais-13b,trust_remote_code=True --tasks hellaswag --num_fewshot 0

The flag use_accelerate=True will load model over multiple GPUs in efficient manner.

msalhab96 changed discussion status to closed

Sign up or log in to comment