Spaces:

open-llm-leaderboard
/

open_llm_leaderboard

Running on CPU Upgrade

App Files Files Community

951

Understanding raw result data files

#729

by jerome-white - opened May 3

Discussion

jerome-white

May 3

I'm interested in parsing the detailed results files for each submission on the leaderboard (open-llm-leaderboard/details_*). It looks like each benchmark has it's own format -- are there:

Open source parsers for any of them? I've been rolling my own, but if there's something I can lean on that'd be better.
Documentation on data semantics (what does each key mean?). Some key-values seem straightforward, but it'd be nice to get an authoritative answer.

SaylorTwift

Open LLM Leaderboard org May 8

Hi ! you can find a way to download the details using the hub api. We do not have an official way to do it in bulk as we do not need this feature.
as for the meaning of the keys, we went through multiple iterations of this logging system therefore they might be a bit different. we usually keep however

For multi choice tasks
- the input of the model
- the different available choices
- the loglikelihood generated by the model for each of those.
- the tokenized version of each the input and choices
for generative tasks
- the input in text and tokenized form
- the generated answer in text and tokenized form
- the target

jerome-white

May 8

Thanks!

you can find a way to download the details using the hub api

I started by using the datasets API, but have found the hub API to be a lot more straightforward (and reliable)

we went through multiple iterations of this logging system therefore they might be a bit different

One difference I've noticed that I'm really trying to get right is how metrics are stored. I've noticed three variations:

As a dictionary in the "metrics" column in which keys are the metric name, and values are the corresponding result
As metric-named columns (so a column literally named "acc", for example, alongside other columns for prompts and what not)
As metric-named columns prefixed with "metric." (a column named "metric.acc" for example)

Are those the only variations I might find?

SaylorTwift

Open LLM Leaderboard org May 15

Yeah it should be the only variations. the metrics dict will have different values inside and those can change. for example, the exact_match metric might sometimes be called em.

alozowski

Open LLM Leaderboard org May 21

Hi everyone!

It seems like we can close this discussion (at least for now). Please, feel free to open a new one in case of any questions!

alozowski changed discussion status to closed May 21

Upload images, audio, and videos by dragging in the text input, pasting, or clicking here.

Tap or paste here to upload images

· Sign up or log in to comment