CUDA out of memory on RTX A5000 inference.

#57

by RoberyanL - opened Jul 28

Jul 28

I am running the model on RTXA5000 with 24g memory, which should satisfy the need, yet when I run the code, it still output CUDA issue. How should I fix this?

from transformers import AutoTokenizer, AutoModelForCausalLM, AutoConfig, pipeline
import torch

# Function to clean CUDA memory
def clean_cuda_memory():
    torch.cuda.empty_cache()
    torch.cuda.ipc_collect()

# Clean CUDA memory before starting
clean_cuda_memory()

model_id = "meta-llama/Meta-Llama-3.1-8B-Instruct"

# Load the tokenizer
tokenizer = AutoTokenizer.from_pretrained(model_id)

# Load the model configuration and modify the rope_scaling parameter
model_config = AutoConfig.from_pretrained(model_id)
model_config.rope_scaling = {"type": "linear", "factor": 8.0}  # Adjust to the required format

# Load the model with the modified configuration
model = AutoModelForCausalLM.from_pretrained(model_id, config=model_config, torch_dtype=torch.float32)

# Initialize the text generation pipeline
pipe = pipeline(
    "text-generation",
    model=model,
    tokenizer=tokenizer,
    device=1
)

system_role = '''
MY DESIGN FOR SYSTEM ROLE
'''

user_input = '''
MY INPUT
'''

# Define the messages
messages = [
    {"role": "system", "content": system_role},
    {"role": "user", "content": user_input},
]

# Concatenate messages into a single prompt
prompt = ""
for message in messages:
    if message["role"] == "system":
        prompt += f"System: {message['content']}\n"
    elif message["role"] == "user":
        prompt += f"User: {message['content']}\n"

# Generate text based on the prompt
output = pipe(
    prompt,
    max_new_tokens=512,
    eos_token_id=tokenizer.eos_token_id,
    do_sample=True,
    temperature=0.6,
    top_p=0.9,
)

# Extract the generated text
generated_text = output[0]["generated_text"]

# Print the response
print(generated_text)

The error message:
torch.cuda.OutOfMemoryError: CUDA out of memory. Tried to allocate 224.00 MiB (GPU 1; 23.69 GiB total capacity; 22.86 GiB already allocated; 128.06 MiB free; 22.87 GiB reserved in total by PyTorch) If reserved memory is >> allocated memory try setting max_split_size_mb to avoid fragmentation. See documentation for Memory Management and PYTORCH_CUDA_ALLOC_CONF

VINUK

Jul 29

Maybe try to run the model in float16 or bfloat16, if not already.

RoberyanL

Jul 29

Thanks bro, this helps. But I see that 8B model can not generate well giving long task instructions, does anyone have better practice suggestions. And I see that 70B model needs VRAM 70GB to run with FP8, or 35GB to run with INT4. Can I use 4 24G GPUs for the FP8?

info-int

11 days ago

@RoberyanL no, joining 4 GPUs doesnt work anymore like it used to. the memory is not shared amongst them. if you can handle the slow inference speeds, pickup 96gb ram and run it from from the CPU.

nbroad

11 days ago

You can also consider using the Inference API to call the model without having to download it.

from huggingface_hub import InferenceClient

client = InferenceClient(
    "meta-llama/Meta-Llama-3.1-8B-Instruct",
    token="hf_xxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxx",
)

for message in client.chat_completion(
    messages=[{"role": "user", "content": "What is the capital of France?"}],
    max_tokens=500,
    stream=True,
):
    print(message.choices[0].delta.content, end="")

Or if you are running it locally, use TGI

LouisMclean

1 day ago

@RoberyanL no, joining 4 GPUs doesnt work anymore like it used to. the memory is not shared amongst them. if you can handle the slow inference speeds, pickup 96gb ram and run it from from the CPU.

@info-int do you have any links where I could read about this? I am also having the same problem @RoberyanL has had.

This link: https://huggingface.co/docs/accelerate/en/usage_guides/big_modeling suggests you can run a model that doesn't fit completely in CUDA memory
And this link: https://huggingface.co/docs/accelerate/en/usage_guides/distributed_inference seems to suggest that with pipeline parallelism you can split a model across multiple GPUs.

nbroad

about 20 hours ago

•

edited about 20 hours ago

You can use TGI which will shard the model across multiple devices and will be far faster than anything you do with the transformers library

Upload images, audio, and videos by dragging in the text input, pasting, or clicking here.

Tap or paste here to upload images

· Sign up or log in to comment