Spaces:
Running
on
CPU Upgrade
ALL Jamba models failing
So far every Jamba model submitted to the leaderboard has failed, including the base model. Any clue what's causing this to happen?
Hi!
Is the architecture integrated in a stable release of transformers?
Could you point to some of the request files as indicated in the About, so we can investigate?
One of the FTs: https://huggingface.co/datasets/open-llm-leaderboard/requests/blob/main/mlabonne/Jambatypus-v0.1_eval_request_False_float16_Original.json
EDIT: As it turns out, one of the models has not failed yet! This makes me think it might be a datatype issue? (Once again, logs would be helpful for diagnosis)
Here's all of the failed urls so far, for good measure:
https://huggingface.co/datasets/open-llm-leaderboard/requests/raw/main/ai21labs/Jamba-v0.1_eval_request_False_float16_Original.json
https://huggingface.co/datasets/open-llm-leaderboard/requests/raw/main/KnutJaegersberg/jamba-bagel-4bit_eval_request_False_float16_Original.json
https://huggingface.co/datasets/open-llm-leaderboard/requests/raw/main/mlabonne/Jambalpaca-v0.1_eval_request_False_float16_Original.json
https://huggingface.co/datasets/open-llm-leaderboard/requests/raw/main/mlabonne/Jambatypus-v0.1_eval_request_False_float16_Original.json
https://huggingface.co/datasets/open-llm-leaderboard/requests/raw/main/Severian/Jamba-Hercules_eval_request_False_float16_Original.json
https://huggingface.co/datasets/open-llm-leaderboard/requests/raw/main/Severian/Jamba-Nexus-4xMoE_eval_request_False_float16_Original.json
https://huggingface.co/datasets/open-llm-leaderboard/requests/raw/main/isemmanuelolowe/jamba_chat_4MoE_8k_eval_request_False_float16_Adapter.json
Hi!
Thanks a lot for the exhaustiveness - apart from the first model which seems to have an inherent problem (I provided the log for this one below), it would seem that all the other ones are failing because we updated our bitsandbytes
version and they made breaking changes in their lib on how to launch configs. We'll update and relaunch everything.
CC
@SaylorTwift
for the backend and
@alozowski
for the relaunches once it's fixed.
We'll do this asap, hopefully we should be good by this evening.
Other failure:
The fast path is not available because on of `(selective_state_update, selective_scan_fn, causal_conv1d_fn, causal_conv1d_update, mamba_inner_fn)` is None. To install follow https://github.com/state-spaces/mamba/#installation and https://github.com/Dao-AILab/causal-conv1d. If you want to use the naive implementation, set `use_mamba_kernels=False` in the model config
[E ProcessGroupNCCL.cpp:474] [Rank 1] Watchdog caught collective operation timeout: WorkNCCL(SeqNum=3, OpType=ALLREDUCE, NumelIn=1, NumelOut=1, Timeout(ms)=3000000) ran for 3000063 milliseconds before timing out.
[E ProcessGroupNCCL.cpp:474] [Rank 3] Watchdog caught collective operation timeout: WorkNCCL(SeqNum=3, OpType=ALLREDUCE, NumelIn=1, NumelOut=1, Timeout(ms)=3000000) ran for 3000057 milliseconds before timing out.
[E ProcessGroupNCCL.cpp:474] [Rank 2] Watchdog caught collective operation timeout: WorkNCCL(SeqNum=3, OpType=ALLREDUCE, NumelIn=1, NumelOut=1, Timeout(ms)=3000000) ran for 3000072 milliseconds before timing out.
...
[E ProcessGroupNCCL.cpp:488] Some NCCL operations have failed or timed out. Due to the asynchronous nature of CUDA kernels, subsequent GPU operations might run on corrupted/incomplete data.
[E ProcessGroupNCCL.cpp:494] To avoid data inconsistency, we are taking the entire process down.
[E ProcessGroupNCCL.cpp:488] Some NCCL operations have failed or timed out. Due to the asynchronous nature of CUDA kernels, subsequent GPU operations might run on corrupted/incomplete data.
[E ProcessGroupNCCL.cpp:494] To avoid data inconsistency, we are taking the entire process down.
[E ProcessGroupNCCL.cpp:488] Some NCCL operations have failed or timed out. Due to the asynchronous nature of CUDA kernels, subsequent GPU operations might run on corrupted/incomplete data.
[E ProcessGroupNCCL.cpp:494] To avoid data inconsistency, we are taking the entire process down.
[E ProcessGroupNCCL.cpp:915] [Rank 1] NCCL watchdog thread terminated with exception: [Rank 1] Watchdog caught collective operation timeout: WorkNCCL(SeqNum=3, OpType=ALLREDUCE, NumelIn=1, NumelOut=1, Timeout(ms)=3000000) ran for 3000063 milliseconds before timing out.
[E ProcessGroupNCCL.cpp:915] [Rank 2] NCCL watchdog thread terminated with exception: [Rank 2] Watchdog caught collective operation timeout: WorkNCCL(SeqNum=3, OpType=ALLREDUCE, NumelIn=1, NumelOut=1, Timeout(ms)=3000000) ran for 3000072 milliseconds before timing out.
[E ProcessGroupNCCL.cpp:915] [Rank 3] NCCL watchdog thread terminated with exception: [Rank 3] Watchdog caught collective operation timeout: WorkNCCL(SeqNum=3, OpType=ALLREDUCE, NumelIn=1, NumelOut=1, Timeout(ms)=3000000) ran for 3000057 milliseconds before timing out.
[2024-04-19 01:09:34,012] torch.distributed.elastic.multiprocessing.api: [WARNING] Sending process 2081115 closing signal SIGTERM
jondurbin/bagel-jamba-v05_eval_request_False_float16_Original.json has finally failed... I'm guessing it's the same as all the others, but since it lasted so long in the first place I'm not sure.
Re: the "inherent" error on the ai21labs/Jamba-v0.1 eval
The warning message that begins "The fast path is not available" is in the custom code from that model repo, but I couldn't find the message anywhere in the transformers-library implementation of JambaForCausalLM.
Is it possible that the model was somehow erroneously run with remote code?
EDIT - I was wrong about this, it's in the HF Jamba code. Just didn't come up in GitHub's search tool the first time.
@devingulliver Normally no as our production environment is fixed with new releases - but it could be possible we made a mistake. We'll relaunch them all at the same time anyway.
Hi! Our prod was fixed last week, and I relaunched all of the above models, feel free to reopen if you need :)
The models are failing again :/
If it's not bitsandbytes, I'm guessing they're all encountering similar failures to the one you posted earlier?
Yep, same error message.
I'm a bit at a loss and we're having more important priorities at the moment, so I'll put this on hold, but reopening so we keep track of the issue.
Perhaps installing the fast Mamba kernels would solve the issue? If it doesn't affect the reproducibility of the rest of the environment of course
It's been a week, wanted to follow up on this?
Hi @devingulliver !
We're still in the process of investigating the error, we'll come back when we get an answer
if it helps at all I have done some pretraining on a small one, you could test with that? https://hf.co/pszemraj/jamba-900M-v0.13-KIx2
Closed without comment... were you able to resolve the issue?