GSM8K results replication

#9
by sam-paech - opened

Hi!

I'm trying to replicate the gsm8k result, using lighteval:

lighteval accelerate \
    --model_args "pretrained=HuggingFaceTB/SmolLM2-1.7B-Instruct,max_gen_toks=800,max_length=2000,dtype=bfloat16" \
    --tasks "lighteval|gsm8k|5|1" \
    --override_batch_size 1 \
    --output_dir="./evals/"

It doesn't seem to be working for me. These are the results:

{
  "config_general": {
    "lighteval_sha": "?",
    "num_fewshot_seeds": 1,
    "override_batch_size": 1,
    "max_samples": null,
    "job_id": "",
    "start_time": 11280063.201784864,
    "end_time": 11281750.641923392,
    "total_evaluation_time_secondes": "1687.4401385281235",
    "model_name": "HuggingFaceTB/SmolLM2-1.7B-Instruct",
    "model_sha": "7eb5a4069bde2ddf31c4303463d32e445d3e7d45",
    "model_dtype": "torch.bfloat16",
    "model_size": "3.19 GB"
  },
  "results": {
    "lighteval|gsm8k|5": {
      "maj@8": 0.001516300227445034,
      "maj@8_stderr": 0.0010717793485492638,
      "qem": 0.0,
      "qem_stderr": 0.0
    },
    "all": {
      "maj@8": 0.001516300227445034,
      "maj@8_stderr": 0.0010717793485492638,
      "qem": 0.0,
      "qem_stderr": 0.0
    }
  }

Any idea what I should be doing instead?

Hugging Face TB Research org

Hello @sam-paech , the default pytorch batching in lighteval cuts off the generations at a single token for most batches if you set the model length to 2000 (due to some of the longer 5-shot prompts).
We're using VLLM's dynamic batching to overcome that: https://github.com/huggingface/smollm/blob/main/evaluation/README.md

Hello @sam-paech , the default pytorch batching in lighteval cuts off the generations at a single token for most batches if you set the model length to 2000 (due to some of the longer 5-shot prompts).
We're using VLLM's dynamic batching to overcome that: https://github.com/huggingface/smollm/blob/main/evaluation/README.md

Great thanks Anton, will try this.

Sign up or log in to comment