Spaces:
Running
on
CPU Upgrade
Understanding raw result data files
I'm interested in parsing the detailed results files for each submission on the leaderboard (open-llm-leaderboard/details_*
). It looks like each benchmark has it's own format -- are there:
- Open source parsers for any of them? I've been rolling my own, but if there's something I can lean on that'd be better.
- Documentation on data semantics (what does each key mean?). Some key-values seem straightforward, but it'd be nice to get an authoritative answer.
Hi ! you can find a way to download the details using the hub api. We do not have an official way to do it in bulk as we do not need this feature.
as for the meaning of the keys, we went through multiple iterations of this logging system therefore they might be a bit different. we usually keep however
- For multi choice tasks
- the input of the model
- the different available choices
- the loglikelihood generated by the model for each of those.
- the tokenized version of each the input and choices
- for generative tasks
- the input in text and tokenized form
- the generated answer in text and tokenized form
- the target
Thanks!
you can find a way to download the details using the hub api
I started by using the datasets API, but have found the hub API to be a lot more straightforward (and reliable)
we went through multiple iterations of this logging system therefore they might be a bit different
One difference I've noticed that I'm really trying to get right is how metrics are stored. I've noticed three variations:
- As a dictionary in the "metrics" column in which keys are the metric name, and values are the corresponding result
- As metric-named columns (so a column literally named "acc", for example, alongside other columns for prompts and what not)
- As metric-named columns prefixed with "metric." (a column named "metric.acc" for example)
Are those the only variations I might find?
Yeah it should be the only variations. the metrics
dict will have different values inside and those can change. for example, the exact_match
metric might sometimes be called em
.
Hi everyone!
It seems like we can close this discussion (at least for now). Please, feel free to open a new one in case of any questions!