Error whilst running the dolly repo on Databricks
OSError: /local_disk0/dolly_training/dolly__2023-04-25T17:01:09 does not appear to have a file named config.json.
I'll investigate this tomorrow, but for now, here's the error I'm getting:
It happens at the following step:
from training.generate import generate_response, load_model_tokenizer_for_generate
model, tokenizer = load_model_tokenizer_for_generate(local_output_dir)
Can you show the contents of that directory? did training complete successfully?
Ah, ok - the training error was below the fold in the previous block, so I didn't spot it.
Here's the entire log from the frame below the above error:
https://pastebin.com/uPwwqJbE
I'm using this instance: g5.12xlarge, so 4x A10G GPUs at 24GB each.
+-----------------------------------------------------------------------------+
| NVIDIA-SMI 470.103.01 Driver Version: 470.103.01 CUDA Version: 11.4 |
|-------------------------------+----------------------+----------------------+
| GPU Name Persistence-M| Bus-Id Disp.A | Volatile Uncorr. ECC |
| Fan Temp Perf Pwr:Usage/Cap| Memory-Usage | GPU-Util Compute M. |
| | | MIG M. |
|===============================+======================+======================|
| 0 NVIDIA A10G Off | 00000000:00:1B.0 Off | 0 |
| 0% 28C P0 57W / 300W | 7808MiB / 22731MiB | 0% Default |
| | | N/A |
+-------------------------------+----------------------+----------------------+
| 1 NVIDIA A10G Off | 00000000:00:1C.0 Off | 0 |
| 0% 29C P0 61W / 300W | 7926MiB / 22731MiB | 0% Default |
| | | N/A |
+-------------------------------+----------------------+----------------------+
| 2 NVIDIA A10G Off | 00000000:00:1D.0 Off | 0 |
| 0% 29C P0 59W / 300W | 5812MiB / 22731MiB | 0% Default |
| | | N/A |
+-------------------------------+----------------------+----------------------+
| 3 NVIDIA A10G Off | 00000000:00:1E.0 Off | 0 |
| 0% 28C P0 59W / 300W | 5546MiB / 22731MiB | 0% Default |
| | | N/A |
+-------------------------------+----------------------+----------------------+
I'm trying pythia 3b (or 2.8b more specifically). Should I rather use a large GPU with more contiguous memory, like A100?
The error you show isn't actually an error, it's a weird ignorable error from the notebook (Databricks needs to fix that). Is there more below? did the training show a problem in the actual cell output? my guess is it didn't finish, but we don't see that output.
4 x A10 is fine for the smallest model, but, did you see these instructions? https://github.com/databrickslabs/dolly#a10-gpus-1 You need to set batch size to 3 or less.
Thanks for the guidance. I made the change in this PR, and it worked: https://github.com/databrickslabs/dolly/pull/135
My thinking is that the missing datetime
import resulted in the timestamped output directory not being created, hence my error.
I successfully trained a 3b model on Databricks with the above GPU configuration in 5.6 hours.
EDIT: the PR is moot. One of my cells didn't run, so datetime
wasn't imported in an earlier cell.