Spaces:
Runtime error
Where is the test dataset?
Checking https://github.com/rohan598/ConTextual/blob/main/data/contextual_all.csv and https://huggingface.co/datasets/ucla-contextual/contextual_all respectively, I only found the full dataset instead of the test dataset. Where is the test dataset after all?
The ambiguity in the description raises questions about which specific dataset—train, test, or full—the leaderboard employs to present its evaluation results. To ensure clarity and facilitate accurate interpretation, it would be helpful to explicitly state the dataset used for evaluation in the leaderboard's documentation.
Hi
@zhiminy
,
Apologies for the confusion,
We have two leaderboards (val) and (test)
For the val leaderboard, please use contextual_val.csv
For the test leaderboard, please use contextual_all.csv
Note: This is only an evaluation benchmark so there are no training samples. The train in this image, is a naming convention of the platform (will look into how we change it)
Val leaderboard is to give you a quick idea about how well your model might perform on the overall dataset and how well it understand these contextual tasks on text-rich images
Test leaderboard is a final evaluation of the performance of your model on all the samples of this dataset.
To prevent over-engineering of the benchmark, we keep release only part of the image, instruction, response triplets (100 out of 506) for validation, while keeping the remaining hidden.
Hi @zhiminy ,
Apologies for the confusion,
We have two leaderboards (val) and (test)For the val leaderboard, please use contextual_val.csv
For the test leaderboard, please use contextual_all.csv
Note: This is only an evaluation benchmark so there are no training samples. The train in this image, is a naming convention of the platform (will look into how we change it)
Val leaderboard is to give you a quick idea about how well your model might perform on the overall dataset and how well it understand these contextual tasks on text-rich images
Test leaderboard is a final evaluation of the performance of your model on all the samples of this dataset.
To prevent over-engineering of the benchmark, we keep release only part of the image, instruction, response triplets (100 out of 506) for validation, while keeping the remaining hidden.
Thanks for your explanation! Considering that "all" actually refers to "test," it would be beneficial to standardize the terminology to avoid any potential confusion among users.
Thanks for spotting this and the suggestion. We have updated all ConTextual resources for consistency. In case you still find something mis aligned, feel free to reopen this issue!