Evaluation

by huangyt - opened Jan 26

Jan 26

I hope this message finds you well! I'm am very interested in your work on evaluating TMMLU+, DRCD, Table, and MMLU.
I noticed that you've used a modified version of EleutherAI's lm-evaluation-harness for these evaluations. Would you be willing to share the changes you made to the code? I'm particularly curious about how you adapted it to assess these specific datasets and models.

Thanks for taking the time to read and respond.

Splend1dchan

MediaTek Research org Jan 30

Hi huangyt,

Thank you for your interest in our evaluation. We have plans to release our evaluation code.
For the time being, we use the perplexity method for all multiple-choice questions.
We frame all few-shot problems as a multi-round chat dialogue when using chat models to evaluate, where one round corresponds to one shot.

Jeff

Jupiter-Y

Feb 22

Hi Splend1dchan,

I'm also very interested in your work on evaluation.
Could you please inform me about the type of machine being used for the evaluation, and how long it is expected to take?

Thanks

Splend1dchan

MediaTek Research org Feb 22

Hi Jupiter-Y,

The evaluation takes ~2hrs for TMMLU + MMLU + table understanding, using 8*H100

Jeff

Upload images, audio, and videos by dragging in the text input, pasting, or clicking here.

Tap or paste here to upload images

· Sign up or log in to comment