Batch Inference causes degraded performance

#43
by tanliboy - opened

I want to bring up attentions to this issue. The batch inference with Gemma-2-9b-it on lm-evaluation-harness leads to significantly degraded performance.

1st run with auto batch size (batch size = 1 after auto detection)

Tasks Version Filter n-shot Metric Value Stderr
ifeval 2 none 0 inst_level_loose_acc 0.7674 ± N/A
none 0 inst_level_strict_acc 0.7554 ± N/A
none 0 prompt_level_loose_acc 0.6784 ± 0.0201
none 0 prompt_level_strict_acc 0.6636 ± 0.0203

2nd run with batch = 32

Tasks Version Filter n-shot Metric Value Stderr
ifeval 2 none 0 inst_level_loose_acc 0.0528 ± N/A
none 0 inst_level_strict_acc 0.0528 ± N/A
none 0 prompt_level_loose_acc 0.0462 ± 0.0090
none 0 prompt_level_strict_acc 0.0462 ± 0.0090

It is likely related to the sliding window issue.

Google org

Hi @tanliboy , I hope this issue is resolved in github for same issue . Could you please let me know if you have any concerns or feel free to close this issue . Thank you.

Thanks! Yes, it is resolved..

tanliboy changed discussion status to closed

Sign up or log in to comment