Spaces:

open-llm-leaderboard
/

open_llm_leaderboard

Running on CPU Upgrade

App Files Files Community

1016

Models Failure

#374

by Weyaxi - opened Nov 14, 2023

Discussion

Weyaxi

Nov 14, 2023

Hi, it seems like the following models have failed. I would greatly appreciate it if you could let me know what is wrong with the models or relaunch them if there is nothing wrong. Thanks, and have a nice day.

Weyaxi changed discussion title from Request the relaunch models to Models Failure Nov 14, 2023

clefourrier

Open LLM Leaderboard org Nov 14, 2023

Hi!
We have a new system to run our evaluations on the HF cluster, where the leaderboard evaluations get cancelled automatically if a higher priority job needs resources.
The jobs get relaunched automatically in the end, but they get displayed as failed in the meantime. We'll try to improve our logging asap!

Side note, for model failures issues, we require users to point to the correct request files for each model, so we can access relevant leaderboard logs faster (see this issue for a very good example).

Weyaxi

Nov 14, 2023

Hi, thanks for your interest and explanation. Have a nice day.

Weyaxi changed discussion status to closed Nov 14, 2023

clefourrier

Open LLM Leaderboard org Nov 14, 2023

No problem! :)
(However, if you observe that models are still not good in a week, feel free to ping us and we'll investigate in more detail to see if something else happened)

Weyaxi

Nov 21, 2023

Hi, it has been a week so pinging you @clefourrier

Here are the request files of FAILED models. I would greatly appreciate it if you could let me know what is wrong with the models or relaunch them if there is nothing wrong.

Btw this is finished but no results: (https://huggingface.co/spaces/HuggingFaceH4/open_llm_leaderboard/discussions/394#655c7b0a1585a0f15f3aa9fd)

https://huggingface.co/datasets/open-llm-leaderboard/requests/blob/main/Weyaxi/test-help-steer-filtered-orig_eval_request_False_float16_Original.json

Weyaxi changed discussion status to open Nov 21, 2023

clefourrier

Open LLM Leaderboard org Nov 21, 2023

•

edited Nov 21, 2023

Some models were cancelled and never relaunched, I'm adding them back to the queue:
Weyaxi/Dolphin-Nebula-7B_eval_request_False_float16_Original.json
Weyaxi/OpenHermes-2.5-Nebula-v2-7B_eval_request_False_float16_Original.json
Weyaxi/OpenOrca-Zephyr-7B_eval_request_False_bfloat16_Original.json
Weyaxi/SynthIA-v1.3-Nebula-v2-7B_eval_request_False_float16_Original.json
PulsarAI/Nebula-v2-7B_eval_request_False_float16_Original.json
This model crashed because of a node failure, adding back too:
Weyaxi/zephyr-beta-Nebula-v2-7B_eval_request_False_float16_Original.json
I think this one was started before we updated the backend's transformers to a version which supports Mistral models (ping @SaylorTwift , can you check the transformers version in the backend?)
Weyaxi/CollectiveCognition-v1.1-Nebula-7B_eval_request_False_float16_Original.json
Lastly, this model is faulty
Weyaxi/Mistral-11B-OpenOrcaPlatypus_eval_request_False_bfloat16_Original.json
It failed with

  File ".../python3.10/site-packages/transformers/models/auto/auto_factory.py", line 565, in from_pretrained
    return model_class.from_pretrained(
  File ".../python3.10/site-packages/transformers/modeling_utils.py", line 3307, in from_pretrained
    ) = cls._load_pretrained_model(
  File ".../python3.10/site-packages/transformers/modeling_utils.py", line 3695, in _load_pretrained_model
    new_error_msgs, offload_index, state_dict_index = _load_state_dict_into_meta_model(
  File ".../lib/python3.10/site-packages/transformers/modeling_utils.py", line 741, in _load_state_dict_into_meta_model
    set_module_tensor_to_device(model, param_name, param_device, **set_module_kwargs)
  File ".../lib/python3.10/site-packages/accelerate/utils/modeling.py", line 281, in set_module_tensor_to_device
    raise ValueError(
ValueError: Trying to set a tensor of shape torch.Size([32002, 4096]) in "weight" (which has shape torch.Size([32000, 4096])), this look incorrect.

clefourrier

Open LLM Leaderboard org Nov 21, 2023

Thank you very much for your patience and for linking the request files :)

Weyaxi

Nov 21, 2023

Hi, thank you very much for relaunching. I will check the last model.

clefourrier

Open LLM Leaderboard org Nov 30, 2023

Closing, feel free to reopen if needed

clefourrier changed discussion status to closed Nov 30, 2023

Weyaxi

Dec 22, 2023

Hi @clefourrier , the following models have failed. Could you please share what went wrong or relaunch them?

Weyaxi changed discussion status to open Dec 22, 2023

clefourrier

Open LLM Leaderboard org Dec 22, 2023

Hi,
The new cluster is having strong connectivity problems, we are putting all evals on hold til it's fixed, and we'll relaunch all FAILED evals of the past 2 days

SaylorTwift

Open LLM Leaderboard org Dec 31, 2023

We solved the connectivity issues and the models have been evaluated :)

SaylorTwift changed discussion status to closed Dec 31, 2023

Upload images, audio, and videos by dragging in the text input, pasting, or clicking here.

Tap or paste here to upload images

· Sign up or log in to comment