Spaces:

osunlp
/

TravelPlannerLeaderboard

Running

App Files Files Community

hsaest commited on Apr 28

Commit

7244d87

•

1 Parent(s): 7ebfa1f

Update content.py

Browse files

Files changed (1) hide show

content.py +11 -0

content.py CHANGED Viewed

@@ -43,6 +43,17 @@ Please refer to [this](https://huggingface.co/datasets/osunlp/TravelPlanner/reso
 Submission made by our team are labelled "TravelPlanner Team". Each submission will be automatically evaluated and scored based on the predefined metrics. You can then obtain the scores and download the detailed constraint pass rates after the evaluation.
 ## Show Your Results on Leaderborad
 If you are interested in showing your results on our leaderboard, we invite you to reach out to us. Please send an email to [us](mailto:[email protected]) including the following details: evaluation mode, fondation model, tool-use strategy, planning strategy, organization, and your paper link (if available), along with your submission files.
 """

 Submission made by our team are labelled "TravelPlanner Team". Each submission will be automatically evaluated and scored based on the predefined metrics. You can then obtain the scores and download the detailed constraint pass rates after the evaluation.
+## ⚠️Warnings
+We release our evaluation scripts to foster innovation and aid the development of new methods.  We encourage the use of evaluation feedback in training set, such as implementing reinforcement learning techniques, to enhance learning. However, we strictly prohibit any form of cheating in the validation and test sets to uphold the fairness and reliability of the benchmark's evaluation process. We reserve the right to disqualify results if we find any of the following violations:
+1. Reverse engineering of our dataset, which includes, but is not limited to:
+   - Converting our natural language queries in the test set to structured formats (e.g., JSON) for optimization and unauthorized evaluation.
+   - Deriving data point entries using the hard rules from our data construction process, without accessing the actual database.
+   - Other similar manipulations.
+2. Hard coding or explicitly writing evaluation cues into prompts by hand, such as direct hints of common sense, which contradicts our goals as it lacks generalizability and is limited to this specific benchmark.
+3. Any other human interference strategies that are tailored specifically to this benchmark but lack generalization capabilities.
 ## Show Your Results on Leaderborad
 If you are interested in showing your results on our leaderboard, we invite you to reach out to us. Please send an email to [us](mailto:[email protected]) including the following details: evaluation mode, fondation model, tool-use strategy, planning strategy, organization, and your paper link (if available), along with your submission files.
 """