Request failed during generation: Server error: CUDA out of memory
Hi I deployed an inference endpoint with 4x Nvidia Tesla T4. Somehow I run into CUDA oom errors.
Any idea?
My curl
curl 'https://rl6864*****82.us-east-1.aws.endpoints.huggingface.cloud/models/mistralai/Mistral-7B-Instruct-v0.1' \
-X POST \
-H 'Content-Type: application/json' \
-H 'Authorization: Bearer hf_IHtiMkkdFxaRX*****KxLtbEfISaXJY' \
-d '{"inputs":"<s>[INST] What is your favourite condiment? [/INST]\\nMy favorite condiment is ketchup. It'\''s versatile, tasty, and goes well with a variety of foods.</s>\\n[INST] And what do you think about it? [/INST]","parameters": { "max_new_tokens": 100, "return_full_text": false }}'
2024/02/22 11:59:09 ~ {"timestamp":"2024-02-22T10:59:09.500434Z","level":"ERROR","message":"Request failed during generation: Server error: CUDA out of memory. Tried to allocate 24.00 MiB. GPU 0 has a total capacty of 14.58 GiB of which 17.56 MiB is free. Process 107364 has 14.56 GiB memory in use. Of the allocated memory 14.18 GiB is allocated by PyTorch, and 161.77 MiB is reserved by PyTorch but unallocated. If reserved but unallocated memory is large try setting max_split_size_mb to avoid fragmentation. See documentation for Memory Management and PYTORCH_CUDA_ALLOC_CONF","target":"text_generation_router::infer","filename":"router/src/infer.rs","line_number":705,"span":{"name":"send_error"},"spans":[{"compute_type":"Extension(ComputeType("4-tesla-t4"))","default_return_full_text":"false","name":"compat_generate"},{"parameters":"GenerateParameters { best_of: None, temperature: None, repetition_penalty: None, frequency_penalty: None, top_k: None, top_p: None, typical_p: None, do_sample: false, max_new_tokens: Some(100), return_full_text: Some(false), stop: [], truncate: None, watermark: false, details: false, decoder_input_details: false, seed: None, top_n_tokens: None, grammar: None }","name":"generate"},{"name":"generate"},{"name":"generate_stream"},{"name":"infer"},{"name":"send_error"}]}