Evaluation

#4
by huangyt - opened

I hope this message finds you well! I'm am very interested in your work on evaluating TMMLU+, DRCD, Table, and MMLU.
I noticed that you've used a modified version of EleutherAI's lm-evaluation-harness for these evaluations. Would you be willing to share the changes you made to the code? I'm particularly curious about how you adapted it to assess these specific datasets and models.

Thanks for taking the time to read and respond.

MediaTek Research org

Hi huangyt,

Thank you for your interest in our evaluation. We have plans to release our evaluation code.
For the time being, we use the perplexity method for all multiple-choice questions.
We frame all few-shot problems as a multi-round chat dialogue when using chat models to evaluate, where one round corresponds to one shot.

Jeff

Hi Splend1dchan,

I'm also very interested in your work on evaluation.
Could you please inform me about the type of machine being used for the evaluation, and how long it is expected to take?

Thanks

MediaTek Research org

Hi Jupiter-Y,

The evaluation takes ~2hrs for TMMLU + MMLU + table understanding, using 8*H100

Jeff

Sign up or log in to comment