ALL Jamba models failing

#690
by devingulliver - opened

So far every Jamba model submitted to the leaderboard has failed, including the base model. Any clue what's causing this to happen?

Open LLM Leaderboard org

Hi!
Is the architecture integrated in a stable release of transformers?

Open LLM Leaderboard org

Could you point to some of the request files as indicated in the About, so we can investigate?

Open LLM Leaderboard org

Hi!
Thanks a lot for the exhaustiveness - apart from the first model which seems to have an inherent problem (I provided the log for this one below), it would seem that all the other ones are failing because we updated our bitsandbytes version and they made breaking changes in their lib on how to launch configs. We'll update and relaunch everything.
CC @SaylorTwift for the backend and @alozowski for the relaunches once it's fixed.

We'll do this asap, hopefully we should be good by this evening.


Other failure:

The fast path is not available because on of `(selective_state_update, selective_scan_fn, causal_conv1d_fn, causal_conv1d_update, mamba_inner_fn)` is None. To install follow https://github.com/state-spaces/mamba/#installation and https://github.com/Dao-AILab/causal-conv1d. If you want to use the naive implementation, set `use_mamba_kernels=False` in the model config
[E ProcessGroupNCCL.cpp:474] [Rank 1] Watchdog caught collective operation timeout: WorkNCCL(SeqNum=3, OpType=ALLREDUCE, NumelIn=1, NumelOut=1, Timeout(ms)=3000000) ran for 3000063 milliseconds before timing out.
[E ProcessGroupNCCL.cpp:474] [Rank 3] Watchdog caught collective operation timeout: WorkNCCL(SeqNum=3, OpType=ALLREDUCE, NumelIn=1, NumelOut=1, Timeout(ms)=3000000) ran for 3000057 milliseconds before timing out.
[E ProcessGroupNCCL.cpp:474] [Rank 2] Watchdog caught collective operation timeout: WorkNCCL(SeqNum=3, OpType=ALLREDUCE, NumelIn=1, NumelOut=1, Timeout(ms)=3000000) ran for 3000072 milliseconds before timing out.
... 
[E ProcessGroupNCCL.cpp:488] Some NCCL operations have failed or timed out. Due to the asynchronous nature of CUDA kernels, subsequent GPU operations might run on corrupted/incomplete data.
[E ProcessGroupNCCL.cpp:494] To avoid data inconsistency, we are taking the entire process down.
[E ProcessGroupNCCL.cpp:488] Some NCCL operations have failed or timed out. Due to the asynchronous nature of CUDA kernels, subsequent GPU operations might run on corrupted/incomplete data.
[E ProcessGroupNCCL.cpp:494] To avoid data inconsistency, we are taking the entire process down.
[E ProcessGroupNCCL.cpp:488] Some NCCL operations have failed or timed out. Due to the asynchronous nature of CUDA kernels, subsequent GPU operations might run on corrupted/incomplete data.
[E ProcessGroupNCCL.cpp:494] To avoid data inconsistency, we are taking the entire process down.
[E ProcessGroupNCCL.cpp:915] [Rank 1] NCCL watchdog thread terminated with exception: [Rank 1] Watchdog caught collective operation timeout: WorkNCCL(SeqNum=3, OpType=ALLREDUCE, NumelIn=1, NumelOut=1, Timeout(ms)=3000000) ran for 3000063 milliseconds before timing out.
[E ProcessGroupNCCL.cpp:915] [Rank 2] NCCL watchdog thread terminated with exception: [Rank 2] Watchdog caught collective operation timeout: WorkNCCL(SeqNum=3, OpType=ALLREDUCE, NumelIn=1, NumelOut=1, Timeout(ms)=3000000) ran for 3000072 milliseconds before timing out.
[E ProcessGroupNCCL.cpp:915] [Rank 3] NCCL watchdog thread terminated with exception: [Rank 3] Watchdog caught collective operation timeout: WorkNCCL(SeqNum=3, OpType=ALLREDUCE, NumelIn=1, NumelOut=1, Timeout(ms)=3000000) ran for 3000057 milliseconds before timing out.
[2024-04-19 01:09:34,012] torch.distributed.elastic.multiprocessing.api: [WARNING] Sending process 2081115 closing signal SIGTERM

jondurbin/bagel-jamba-v05_eval_request_False_float16_Original.json has finally failed... I'm guessing it's the same as all the others, but since it lasted so long in the first place I'm not sure.

Re: the "inherent" error on the ai21labs/Jamba-v0.1 eval
The warning message that begins "The fast path is not available" is in the custom code from that model repo, but I couldn't find the message anywhere in the transformers-library implementation of JambaForCausalLM.
Is it possible that the model was somehow erroneously run with remote code?
EDIT - I was wrong about this, it's in the HF Jamba code. Just didn't come up in GitHub's search tool the first time.

Open LLM Leaderboard org

@devingulliver Normally no as our production environment is fixed with new releases - but it could be possible we made a mistake. We'll relaunch them all at the same time anyway.

Open LLM Leaderboard org

Hi! Our prod was fixed last week, and I relaunched all of the above models, feel free to reopen if you need :)

clefourrier changed discussion status to closed

The models are failing again :/
If it's not bitsandbytes, I'm guessing they're all encountering similar failures to the one you posted earlier?

Open LLM Leaderboard org
edited Apr 30

Yep, same error message.
I'm a bit at a loss and we're having more important priorities at the moment, so I'll put this on hold, but reopening so we keep track of the issue.

clefourrier changed discussion status to open

Perhaps installing the fast Mamba kernels would solve the issue? If it doesn't affect the reproducibility of the rest of the environment of course

It's been a week, wanted to follow up on this?

Open LLM Leaderboard org

Hi @devingulliver !

We're still in the process of investigating the error, we'll come back when we get an answer

if it helps at all I have done some pretraining on a small one, you could test with that? https://hf.co/pszemraj/jamba-900M-v0.13-KIx2

clefourrier changed discussion status to closed

Closed without comment... were you able to resolve the issue?

Sign up or log in to comment