Error whilst running the dolly repo on Databricks

#10

by opyate - opened Apr 25, 2023

Discussion

opyate

Apr 25, 2023

•

edited Apr 25, 2023

OSError: /local_disk0/dolly_training/dolly__2023-04-25T17:01:09 does not appear to have a file named config.json.

I'll investigate this tomorrow, but for now, here's the error I'm getting:

It happens at the following step:

from training.generate import generate_response, load_model_tokenizer_for_generate

model, tokenizer = load_model_tokenizer_for_generate(local_output_dir)

srowen

Databricks org Apr 25, 2023

Can you show the contents of that directory? did training complete successfully?

opyate

Apr 25, 2023

Ah, ok - the training error was below the fold in the previous block, so I didn't spot it.

Here's the entire log from the frame below the above error:
https://pastebin.com/uPwwqJbE

I'm using this instance: g5.12xlarge, so 4x A10G GPUs at 24GB each.

+-----------------------------------------------------------------------------+
| NVIDIA-SMI 470.103.01   Driver Version: 470.103.01   CUDA Version: 11.4     |
|-------------------------------+----------------------+----------------------+
| GPU  Name        Persistence-M| Bus-Id        Disp.A | Volatile Uncorr. ECC |
| Fan  Temp  Perf  Pwr:Usage/Cap|         Memory-Usage | GPU-Util  Compute M. |
|                               |                      |               MIG M. |
|===============================+======================+======================|
|   0  NVIDIA A10G         Off  | 00000000:00:1B.0 Off |                    0 |
|  0%   28C    P0    57W / 300W |   7808MiB / 22731MiB |      0%      Default |
|                               |                      |                  N/A |
+-------------------------------+----------------------+----------------------+
|   1  NVIDIA A10G         Off  | 00000000:00:1C.0 Off |                    0 |
|  0%   29C    P0    61W / 300W |   7926MiB / 22731MiB |      0%      Default |
|                               |                      |                  N/A |
+-------------------------------+----------------------+----------------------+
|   2  NVIDIA A10G         Off  | 00000000:00:1D.0 Off |                    0 |
|  0%   29C    P0    59W / 300W |   5812MiB / 22731MiB |      0%      Default |
|                               |                      |                  N/A |
+-------------------------------+----------------------+----------------------+
|   3  NVIDIA A10G         Off  | 00000000:00:1E.0 Off |                    0 |
|  0%   28C    P0    59W / 300W |   5546MiB / 22731MiB |      0%      Default |
|                               |                      |                  N/A |
+-------------------------------+----------------------+----------------------+

I'm trying pythia 3b (or 2.8b more specifically). Should I rather use a large GPU with more contiguous memory, like A100?

srowen

Databricks org Apr 25, 2023

The error you show isn't actually an error, it's a weird ignorable error from the notebook (Databricks needs to fix that). Is there more below? did the training show a problem in the actual cell output? my guess is it didn't finish, but we don't see that output.

4 x A10 is fine for the smallest model, but, did you see these instructions? https://github.com/databrickslabs/dolly#a10-gpus-1 You need to set batch size to 3 or less.

opyate

Apr 26, 2023

•

edited Apr 26, 2023

Thanks for the guidance. I made the change in this PR, and it worked: https://github.com/databrickslabs/dolly/pull/135

My thinking is that the missing datetime import resulted in the timestamped output directory not being created, hence my error.

I successfully trained a 3b model on Databricks with the above GPU configuration in 5.6 hours.

EDIT: the PR is moot. One of my cells didn't run, so datetime wasn't imported in an earlier cell.

opyate changed discussion status to closed Apr 27, 2023

Upload images, audio, and videos by dragging in the text input, pasting, or clicking here.

Tap or paste here to upload images

· Sign up or log in to comment