open-llm-leaderboard/open_llm_leaderboard · ALL Jamba models failing

EDIT: As it turns out, one of the models has not failed yet! This makes me think it might be a datatype issue? (Once again, logs would be helpful for diagnosis)

devingulliver

Apr 21

Here's all of the failed urls so far, for good measure:
https://huggingface.co/datasets/open-llm-leaderboard/requests/raw/main/ai21labs/Jamba-v0.1_eval_request_False_float16_Original.json
https://huggingface.co/datasets/open-llm-leaderboard/requests/raw/main/KnutJaegersberg/jamba-bagel-4bit_eval_request_False_float16_Original.json
https://huggingface.co/datasets/open-llm-leaderboard/requests/raw/main/mlabonne/Jambalpaca-v0.1_eval_request_False_float16_Original.json
https://huggingface.co/datasets/open-llm-leaderboard/requests/raw/main/mlabonne/Jambatypus-v0.1_eval_request_False_float16_Original.json
https://huggingface.co/datasets/open-llm-leaderboard/requests/raw/main/Severian/Jamba-Hercules_eval_request_False_float16_Original.json
https://huggingface.co/datasets/open-llm-leaderboard/requests/raw/main/Severian/Jamba-Nexus-4xMoE_eval_request_False_float16_Original.json
https://huggingface.co/datasets/open-llm-leaderboard/requests/raw/main/isemmanuelolowe/jamba_chat_4MoE_8k_eval_request_False_float16_Adapter.json

clefourrier

Open LLM Leaderboard org Apr 22

Hi!
Thanks a lot for the exhaustiveness - apart from the first model which seems to have an inherent problem (I provided the log for this one below), it would seem that all the other ones are failing because we updated our bitsandbytes version and they made breaking changes in their lib on how to launch configs. We'll update and relaunch everything.
CC @SaylorTwift for the backend and @alozowski for the relaunches once it's fixed.

We'll do this asap, hopefully we should be good by this evening.

Other failure:

The fast path is not available because on of `(selective_state_update, selective_scan_fn, causal_conv1d_fn, causal_conv1d_update, mamba_inner_fn)` is None. To install follow https://github.com/state-spaces/mamba/#installation and https://github.com/Dao-AILab/causal-conv1d. If you want to use the naive implementation, set `use_mamba_kernels=False` in the model config
[E ProcessGroupNCCL.cpp:474] [Rank 1] Watchdog caught collective operation timeout: WorkNCCL(SeqNum=3, OpType=ALLREDUCE, NumelIn=1, NumelOut=1, Timeout(ms)=3000000) ran for 3000063 milliseconds before timing out.
[E ProcessGroupNCCL.cpp:474] [Rank 3] Watchdog caught collective operation timeout: WorkNCCL(SeqNum=3, OpType=ALLREDUCE, NumelIn=1, NumelOut=1, Timeout(ms)=3000000) ran for 3000057 milliseconds before timing out.
[E ProcessGroupNCCL.cpp:474] [Rank 2] Watchdog caught collective operation timeout: WorkNCCL(SeqNum=3, OpType=ALLREDUCE, NumelIn=1, NumelOut=1, Timeout(ms)=3000000) ran for 3000072 milliseconds before timing out.
... 
[E ProcessGroupNCCL.cpp:488] Some NCCL operations have failed or timed out. Due to the asynchronous nature of CUDA kernels, subsequent GPU operations might run on corrupted/incomplete data.
[E ProcessGroupNCCL.cpp:494] To avoid data inconsistency, we are taking the entire process down.
[E ProcessGroupNCCL.cpp:488] Some NCCL operations have failed or timed out. Due to the asynchronous nature of CUDA kernels, subsequent GPU operations might run on corrupted/incomplete data.
[E ProcessGroupNCCL.cpp:494] To avoid data inconsistency, we are taking the entire process down.
[E ProcessGroupNCCL.cpp:488] Some NCCL operations have failed or timed out. Due to the asynchronous nature of CUDA kernels, subsequent GPU operations might run on corrupted/incomplete data.
[E ProcessGroupNCCL.cpp:494] To avoid data inconsistency, we are taking the entire process down.
[E ProcessGroupNCCL.cpp:915] [Rank 1] NCCL watchdog thread terminated with exception: [Rank 1] Watchdog caught collective operation timeout: WorkNCCL(SeqNum=3, OpType=ALLREDUCE, NumelIn=1, NumelOut=1, Timeout(ms)=3000000) ran for 3000063 milliseconds before timing out.
[E ProcessGroupNCCL.cpp:915] [Rank 2] NCCL watchdog thread terminated with exception: [Rank 2] Watchdog caught collective operation timeout: WorkNCCL(SeqNum=3, OpType=ALLREDUCE, NumelIn=1, NumelOut=1, Timeout(ms)=3000000) ran for 3000072 milliseconds before timing out.
[E ProcessGroupNCCL.cpp:915] [Rank 3] NCCL watchdog thread terminated with exception: [Rank 3] Watchdog caught collective operation timeout: WorkNCCL(SeqNum=3, OpType=ALLREDUCE, NumelIn=1, NumelOut=1, Timeout(ms)=3000000) ran for 3000057 milliseconds before timing out.
[2024-04-19 01:09:34,012] torch.distributed.elastic.multiprocessing.api: [WARNING] Sending process 2081115 closing signal SIGTERM

devingulliver

Apr 22

jondurbin/bagel-jamba-v05_eval_request_False_float16_Original.json has finally failed... I'm guessing it's the same as all the others, but since it lasted so long in the first place I'm not sure.

devingulliver

Apr 23

•

edited Apr 30

Re: the "inherent" error on the ai21labs/Jamba-v0.1 eval
The warning message that begins "The fast path is not available" is in the custom code from that model repo, but I couldn't find the message anywhere in the transformers-library implementation of JambaForCausalLM.
Is it possible that the model was somehow erroneously run with remote code?
EDIT - I was wrong about this, it's in the HF Jamba code. Just didn't come up in GitHub's search tool the first time.

clefourrier

Open LLM Leaderboard org Apr 23

@devingulliver Normally no as our production environment is fixed with new releases - but it could be possible we made a mistake. We'll relaunch them all at the same time anyway.

clefourrier

Open LLM Leaderboard org Apr 29

Hi! Our prod was fixed last week, and I relaunched all of the above models, feel free to reopen if you need :)

clefourrier changed discussion status to closed Apr 29

devingulliver

Apr 30

The models are failing again :/
If it's not bitsandbytes, I'm guessing they're all encountering similar failures to the one you posted earlier?

clefourrier

Open LLM Leaderboard org Apr 30

•

edited Apr 30

Yep, same error message.
I'm a bit at a loss and we're having more important priorities at the moment, so I'll put this on hold, but reopening so we keep track of the issue.