Spaces:
Running
on
CPU Upgrade
Models Failure
Hi, it seems like the following models have failed. I would greatly appreciate it if you could let me know what is wrong with the models or relaunch them if there is nothing wrong. Thanks, and have a nice day.
Hi!
We have a new system to run our evaluations on the HF cluster, where the leaderboard evaluations get cancelled automatically if a higher priority job needs resources.
The jobs get relaunched automatically in the end, but they get displayed as failed in the meantime. We'll try to improve our logging asap!
Side note, for model failures issues, we require users to point to the correct request files for each model, so we can access relevant leaderboard logs faster (see this issue for a very good example).
Hi, thanks for your interest and explanation. Have a nice day.
No problem! :)
(However, if you observe that models are still not good in a week, feel free to ping us and we'll investigate in more detail to see if something else happened)
Hi, it has been a week so pinging you @clefourrier
Here are the request files of FAILED models. I would greatly appreciate it if you could let me know what is wrong with the models or relaunch them if there is nothing wrong.
- Weyaxi/CollectiveCognition-v1.1-Nebula-7B_eval_request_False_float16_Original.json
- Weyaxi/Dolphin-Nebula-7B_eval_request_False_float16_Original.json
- Weyaxi/Mistral-11B-OpenOrcaPlatypus_eval_request_False_bfloat16_Original.json
- Weyaxi/OpenHermes-2.5-Nebula-v2-7B_eval_request_False_float16_Original.json
- Weyaxi/OpenOrca-Zephyr-7B_eval_request_False_bfloat16_Original.json
- Weyaxi/SynthIA-v1.3-Nebula-v2-7B_eval_request_False_float16_Original.json
- Weyaxi/zephyr-beta-Nebula-v2-7B_eval_request_False_float16_Original.json
- PulsarAI/Nebula-v2-7B_eval_request_False_float16_Original.json
Btw this is finished but no results: (https://huggingface.co/spaces/HuggingFaceH4/open_llm_leaderboard/discussions/394#655c7b0a1585a0f15f3aa9fd)
Some models were cancelled and never relaunched, I'm adding them back to the queue:
Weyaxi/Dolphin-Nebula-7B_eval_request_False_float16_Original.json
Weyaxi/OpenHermes-2.5-Nebula-v2-7B_eval_request_False_float16_Original.json
Weyaxi/OpenOrca-Zephyr-7B_eval_request_False_bfloat16_Original.json
Weyaxi/SynthIA-v1.3-Nebula-v2-7B_eval_request_False_float16_Original.json
PulsarAI/Nebula-v2-7B_eval_request_False_float16_Original.jsonThis model crashed because of a node failure, adding back too:
Weyaxi/zephyr-beta-Nebula-v2-7B_eval_request_False_float16_Original.jsonI think this one was started before we updated the backend's transformers to a version which supports Mistral models (ping @SaylorTwift , can you check the transformers version in the backend?)
Weyaxi/CollectiveCognition-v1.1-Nebula-7B_eval_request_False_float16_Original.jsonLastly, this model is faulty
Weyaxi/Mistral-11B-OpenOrcaPlatypus_eval_request_False_bfloat16_Original.json
It failed with
File ".../python3.10/site-packages/transformers/models/auto/auto_factory.py", line 565, in from_pretrained
return model_class.from_pretrained(
File ".../python3.10/site-packages/transformers/modeling_utils.py", line 3307, in from_pretrained
) = cls._load_pretrained_model(
File ".../python3.10/site-packages/transformers/modeling_utils.py", line 3695, in _load_pretrained_model
new_error_msgs, offload_index, state_dict_index = _load_state_dict_into_meta_model(
File ".../lib/python3.10/site-packages/transformers/modeling_utils.py", line 741, in _load_state_dict_into_meta_model
set_module_tensor_to_device(model, param_name, param_device, **set_module_kwargs)
File ".../lib/python3.10/site-packages/accelerate/utils/modeling.py", line 281, in set_module_tensor_to_device
raise ValueError(
ValueError: Trying to set a tensor of shape torch.Size([32002, 4096]) in "weight" (which has shape torch.Size([32000, 4096])), this look incorrect.
Thank you very much for your patience and for linking the request files :)
Hi, thank you very much for relaunching. I will check the last model.
Closing, feel free to reopen if needed
Hi @clefourrier , the following models have failed. Could you please share what went wrong or relaunch them?
Hi,
The new cluster is having strong connectivity problems, we are putting all evals on hold til it's fixed, and we'll relaunch all FAILED evals of the past 2 days
We solved the connectivity issues and the models have been evaluated :)