Spaces:

upstage
/

open-ko-llm-leaderboard

Running on CPU Upgrade

App Files Files Community

Sean Cho commited on Sep 8, 2023

Commit

a86ccea

•

1 Parent(s): ce5c604

Update texts

Browse files

Files changed (2) hide show

app.py +3 -3
src/assets/text_content.py +12 -80

app.py CHANGED Viewed

@@ -322,7 +322,7 @@ with demo:
                         )
                     with gr.Row():
                         deleted_models_visibility = gr.Checkbox(
-                            value=True, label="Show gated/private/deleted models", interactive=True
                         )
                 with gr.Column(min_width=320):
                     search_bar = gr.Textbox(
@@ -455,7 +455,7 @@ with demo:
                                 max_rows=5,
                             )
                     with gr.Accordion(
-                        f"🔄 평가 진행 대기열 ({len(running_eval_queue_df)})",
                         open=False,
                     ):
                         with gr.Row():
@@ -467,7 +467,7 @@ with demo:
                             )
                     with gr.Accordion(
-                        f"⏳ 평가 대기 대기열 ({len(pending_eval_queue_df)})",
                         open=False,
                     ):
                         with gr.Row():

                         )
                     with gr.Row():
                         deleted_models_visibility = gr.Checkbox(
+                            value=True, label="👀 삭제/비공개된 모델도 함께 보기", interactive=True
                         )
                 with gr.Column(min_width=320):
                     search_bar = gr.Textbox(
                                 max_rows=5,
                             )
                     with gr.Accordion(
+                        f"🔄 평가 진행 중 ({len(running_eval_queue_df)})",
                         open=False,
                     ):
                         with gr.Row():
                             )
                     with gr.Accordion(
+                        f"⏳ 평가 대기 중 ({len(pending_eval_queue_df)})",
                         open=False,
                     ):
                         with gr.Row():

src/assets/text_content.py CHANGED Viewed

@@ -40,29 +40,18 @@ LLM 시대에 걸맞는 평가를 위해 상식, 전문 지식, 추론, 환각,
 KT로부터 평가에 사용되는 GPU를 제공받았습니다.
-## Details and logs
-You can find:
 - 좀 더 자세한 수치 정보는: https://huggingface.co/datasets/open-llm-leaderboard/results
 - 모델의 입출력에 대한 자세한 정보는: https://huggingface.co/datasets/open-llm-leaderboard/details
 - 모델의 평가 큐와 평가 상태는: https://huggingface.co/datasets/open-llm-leaderboard/requests
-## Reproducibility
 평가 결과를 재현하기 위해서는 [이 버전](https://github.com/EleutherAI/lm-evaluation-harness/tree/b281b0921b636bc36ad05c0b0b0763bd6dd43463)의 데이터셋을 이용하세요. (밑에는 코드 및 평가 환경이라서 일단 skip)
-The total batch size we get for models which fit on one A100 node is 16 (8 GPUs * 2). If you don't use parallelism, adapt your batch size to fit.
-*You can expect results to vary slightly for different batch sizes because of padding.*
-The tasks and few shots parameters are:
-- ARC: 25-shot, *arc-challenge* (`acc_norm`)
-- HellaSwag: 10-shot, *hellaswag* (`acc_norm`)
-- TruthfulQA: 0-shot, *truthfulqa-mc* (`mc2`)
-- MMLU: 5-shot, *hendrycksTest-abstract_algebra,hendrycksTest-anatomy,hendrycksTest-astronomy,hendrycksTest-business_ethics,hendrycksTest-clinical_knowledge,hendrycksTest-college_biology,hendrycksTest-college_chemistry,hendrycksTest-college_computer_science,hendrycksTest-college_mathematics,hendrycksTest-college_medicine,hendrycksTest-college_physics,hendrycksTest-computer_security,hendrycksTest-conceptual_physics,hendrycksTest-econometrics,hendrycksTest-electrical_engineering,hendrycksTest-elementary_mathematics,hendrycksTest-formal_logic,hendrycksTest-global_facts,hendrycksTest-high_school_biology,hendrycksTest-high_school_chemistry,hendrycksTest-high_school_computer_science,hendrycksTest-high_school_european_history,hendrycksTest-high_school_geography,hendrycksTest-high_school_government_and_politics,hendrycksTest-high_school_macroeconomics,hendrycksTest-high_school_mathematics,hendrycksTest-high_school_microeconomics,hendrycksTest-high_school_physics,hendrycksTest-high_school_psychology,hendrycksTest-high_school_statistics,hendrycksTest-high_school_us_history,hendrycksTest-high_school_world_history,hendrycksTest-human_aging,hendrycksTest-human_sexuality,hendrycksTest-international_law,hendrycksTest-jurisprudence,hendrycksTest-logical_fallacies,hendrycksTest-machine_learning,hendrycksTest-management,hendrycksTest-marketing,hendrycksTest-medical_genetics,hendrycksTest-miscellaneous,hendrycksTest-moral_disputes,hendrycksTest-moral_scenarios,hendrycksTest-nutrition,hendrycksTest-philosophy,hendrycksTest-prehistory,hendrycksTest-professional_accounting,hendrycksTest-professional_law,hendrycksTest-professional_medicine,hendrycksTest-professional_psychology,hendrycksTest-public_relations,hendrycksTest-security_studies,hendrycksTest-sociology,hendrycksTest-us_foreign_policy,hendrycksTest-virology,hendrycksTest-world_religions* (average of all the results `acc`)
-## Quantization
-To get more information about quantization, see:
-- 8 bits: [blog post](https://huggingface.co/blog/hf-bitsandbytes-integration), [paper](https://arxiv.org/abs/2208.07339)
-- 4 bits: [blog post](https://huggingface.co/blog/4bit-transformers-bitsandbytes), [paper](https://arxiv.org/abs/2305.14314)
 """
 EVALUATION_QUEUE_TEXT = f"""
@@ -98,68 +87,11 @@ safetensors는 weight를 보관하는 새로운 포맷으로, 훨씬 안전하
 CITATION_BUTTON_LABEL = "Copy the following snippet to cite these results"
 CITATION_BUTTON_TEXT = r"""
-@misc{open-llm-leaderboard,
-  author = {Edward Beeching, Clémentine Fourrier, Nathan Habib, Sheon Han, Nathan Lambert, Nazneen Rajani, Omar Sanseviero, Lewis Tunstall, Thomas Wolf},
-  title = {Open LLM Leaderboard},
   year = {2023},
-  publisher = {Hugging Face},
-  howpublished = "\url{https://huggingface.co/spaces/HuggingFaceH4/open_llm_leaderboard}"
-}
-@software{eval-harness,
-  author       = {Gao, Leo and
-                  Tow, Jonathan and
-                  Biderman, Stella and
-                  Black, Sid and
-                  DiPofi, Anthony and
-                  Foster, Charles and
-                  Golding, Laurence and
-                  Hsu, Jeffrey and
-                  McDonell, Kyle and
-                  Muennighoff, Niklas and
-                  Phang, Jason and
-                  Reynolds, Laria and
-                  Tang, Eric and
-                  Thite, Anish and
-                  Wang, Ben and
-                  Wang, Kevin and
-                  Zou, Andy},
-  title        = {A framework for few-shot language model evaluation},
-  month        = sep,
-  year         = 2021,
-  publisher    = {Zenodo},
-  version      = {v0.0.1},
-  doi          = {10.5281/zenodo.5371628},
-  url          = {https://doi.org/10.5281/zenodo.5371628}
 }
-@misc{clark2018think,
-      title={Think you have Solved Question Answering? Try ARC, the AI2 Reasoning Challenge},
-      author={Peter Clark and Isaac Cowhey and Oren Etzioni and Tushar Khot and Ashish Sabharwal and Carissa Schoenick and Oyvind Tafjord},
-      year={2018},
-      eprint={1803.05457},
-      archivePrefix={arXiv},
-      primaryClass={cs.AI}
-}
-@misc{zellers2019hellaswag,
-      title={HellaSwag: Can a Machine Really Finish Your Sentence?},
-      author={Rowan Zellers and Ari Holtzman and Yonatan Bisk and Ali Farhadi and Yejin Choi},
-      year={2019},
-      eprint={1905.07830},
-      archivePrefix={arXiv},
-      primaryClass={cs.CL}
-}
-@misc{hendrycks2021measuring,
-      title={Measuring Massive Multitask Language Understanding},
-      author={Dan Hendrycks and Collin Burns and Steven Basart and Andy Zou and Mantas Mazeika and Dawn Song and Jacob Steinhardt},
-      year={2021},
-      eprint={2009.03300},
-      archivePrefix={arXiv},
-      primaryClass={cs.CY}
-}
-@misc{lin2022truthfulqa,
-      title={TruthfulQA: Measuring How Models Mimic Human Falsehoods},
-      author={Stephanie Lin and Jacob Hilton and Owain Evans},
-      year={2022},
-      eprint={2109.07958},
-      archivePrefix={arXiv},
-      primaryClass={cs.CL}
-}"""

 KT로부터 평가에 사용되는 GPU를 제공받았습니다.
+## 좀 더 자세한 정보
 - 좀 더 자세한 수치 정보는: https://huggingface.co/datasets/open-llm-leaderboard/results
 - 모델의 입출력에 대한 자세한 정보는: https://huggingface.co/datasets/open-llm-leaderboard/details
 - 모델의 평가 큐와 평가 상태는: https://huggingface.co/datasets/open-llm-leaderboard/requests
+## 결과 재현
 평가 결과를 재현하기 위해서는 [이 버전](https://github.com/EleutherAI/lm-evaluation-harness/tree/b281b0921b636bc36ad05c0b0b0763bd6dd43463)의 데이터셋을 이용하세요. (밑에는 코드 및 평가 환경이라서 일단 skip)
+## 더보기
+질문이 있으시면 [여기](https://huggingface.co/spaces/BearSean/leaderboard-test/discussions/1)서 FAQ를 확인하실 수 있습니다!
+또한 커뮤니티, 다른 팀 및 연구소에서 제공하는 멋진 자료를 여기에 모아 두었습니다!
+If you still have questions, you can check our FAQ here! We also gather cool resources from the community, other teams, and other labs here!
 """
 EVALUATION_QUEUE_TEXT = f"""
 CITATION_BUTTON_LABEL = "Copy the following snippet to cite these results"
 CITATION_BUTTON_TEXT = r"""
+@misc{open-ko-llm-leaderboard,
+  author = {},
+  title = {Open Ko LLM Leaderboard},
   year = {2023},
+  publisher = {Upstage, National Information Society Agency},
+  howpublished = "\url{https://huggingface.co/spaces/BearSean/leaderboard-test/discussions/1}"
 }
+"""