Trying to load on 8xA10 in 4 bit gives this error
ValueError:
Some modules are dispatched on the CPU or the disk. Make sure you have enough GPU RAM to fit
the quantized mode l. If you want to dispatch the model on the CPU or the disk while keeping
these modules in 32-bit, you need to set load_in_8bit_fp32_cpu_offload=True and pass a custom
device_map to from_pretrained. Check
https://huggingface.co/docs/transformers/main/en/main_classes/quantization#offload-between-cpu-and-gpu
for more details.
@nbilla what is your code for loading the model? Don’t set device map = auto or low cpu mem usage since bitsandbytes automatically sets device map = auto.
It will help if I can see your loading code
@nbilla
Just tried loading it on 8x A100 80GB and it was using 20GB vram in each GPU.
For your case maybe it's because the device_map="auto" miscalculated some usage and placed some of the modules on CPU, could you try again and use this device map?
{'transformer.in_out_embed': 0, 'lm_head': 0, 'transformer.decoder_layer.0': 0, 'transformer.decoder_layer.1': 0, 'transformer.decoder_layer.2': 0, 'transformer.decoder_layer.3': 0, 'transformer.decoder_layer.4': 0, 'transformer.decoder_layer.5': 0, 'transformer.decoder_layer.6': 1, 'transformer.decoder_layer.7': 1, 'transformer.decoder_layer.8': 1, 'transformer.decoder_layer.9': 1, 'transformer.decoder_layer.10': 1, 'transformer.decoder_layer.11': 1, 'transformer.decoder_layer.12': 1, 'transformer.decoder_layer.13': 1, 'transformer.decoder_layer.14': 2, 'transformer.decoder_layer.15': 2, 'transformer.decoder_layer.16': 2, 'transformer.decoder_layer.17': 2, 'transformer.decoder_layer.18': 2, 'transformer.decoder_layer.19': 2, 'transformer.decoder_layer.20': 2, 'transformer.decoder_layer.21': 2, 'transformer.decoder_layer.22': 3, 'transformer.decoder_layer.23': 3, 'transformer.decoder_layer.24': 3, 'transformer.decoder_layer.25': 3, 'transformer.decoder_layer.26': 3, 'transformer.decoder_layer.27': 3, 'transformer.decoder_layer.28': 3, 'transformer.decoder_layer.29': 3, 'transformer.decoder_layer.30': 4, 'transformer.decoder_layer.31': 4, 'transformer.decoder_layer.32': 4, 'transformer.decoder_layer.33': 4, 'transformer.decoder_layer.34': 4, 'transformer.decoder_layer.35': 4, 'transformer.decoder_layer.36': 4, 'transformer.decoder_layer.37': 4, 'transformer.decoder_layer.38': 5, 'transformer.decoder_layer.39': 5, 'transformer.decoder_layer.40': 5, 'transformer.decoder_layer.41': 5, 'transformer.decoder_layer.42': 5, 'transformer.decoder_layer.43': 5, 'transformer.decoder_layer.44': 5, 'transformer.decoder_layer.45': 5, 'transformer.decoder_layer.46': 6, 'transformer.decoder_layer.47': 6, 'transformer.decoder_layer.48': 6, 'transformer.decoder_layer.49': 6, 'transformer.decoder_layer.50': 6, 'transformer.decoder_layer.51': 6, 'transformer.decoder_layer.52': 6, 'transformer.decoder_layer.53': 6, 'transformer.decoder_layer.54': 7, 'transformer.decoder_layer.55': 7, 'transformer.decoder_layer.56': 7, 'transformer.decoder_layer.57': 7, 'transformer.decoder_layer.58': 7, 'transformer.decoder_layer.59': 7, 'transformer.decoder_layer.60': 7, 'transformer.decoder_layer.61': 7, 'transformer.decoder_layer.62': 7, 'transformer.decoder_layer.63': 7, 'transformer.rms_norm': 7}
@v2ray
if it is using 20GB vram in each GPU, should it be possible to load it on 4x A100 80GB? However, for me it crashes with CUDA out of memory error after loading about 50% of the weights in 4bit.
the code I am using:
from modeling_grok import GrokForCausalLM
from transformers import BitsAndBytesConfig
from accelerate import Accelerator
double_quant_config = BitsAndBytesConfig(
load_in_4bit=True,
bnb_4bit_use_double_quant=True,
)
model = GrokForCausalLM.from_pretrained(
"/path/to/grok-1-hf",
quantization_config=double_quant_config,
device_map={"":Accelerator().process_index}
)
I also tried to use your device map, same result.
@ruslandev Why are you using accelerate? Can you share the accelerate's config?
Eventually I did it without accelerate - I loaded the model on 4 A100, 50% vram was used in each GPU.
With accelerate, I tried to do distributed inference, but looks like it was loading the entire model on each GPU, instead of splitting it between processes.
Accelerate config -
compute_environment: LOCAL_MACHINE
distributed_type: MULTI_GPU
downcast_bf16: 'no'
machine_rank: 0
main_training_function: main
mixed_precision: fp16
num_machines: 1
num_processes: 4
rdzv_backend: static
same_network: true
tpu_env: []
tpu_use_cluster: false
tpu_use_sudo: false
use_cpu: false