Out of memory error on multiple runs
I'm trying to run a simple code to generate multiple images:
pipe = StableDiffusion3Pipeline.from_pretrained("stabilityai/stable-diffusion-3-medium-diffusers", torch_dtype=torch.float16)
pipe = pipe.to("cuda")
for i in range(10):
image = pipe(
my_prompt,
negative_prompt="",
num_inference_steps=28,
guidance_scale=7.0,
).images[0]
image.save(f"test{i}.png")
however, after a few iterations I get an out of memory error:
torch.cuda.OutOfMemoryError: CUDA out of memory. Tried to allocate 32.00 MiB. GPU
Am I doing something wrong?
Moreover, the same code soon goes out of memory if I try to run on a mps device on my MacBook M3
A few ideas, one,
del image
at the end of each loop. Without going into details I don't really understand, Python doesn't release memory for a variable until it's deleted. I suspect each iteration you are holding the entire model in RAM.
If that doesn't work, make a garbage collection function and call it after the del image.
Here is mine with print statements about memory allocated. I haven't reviewed this in a long time, so a lot of it is excess print statements.
def reclaim_mem():
allocated_memory = torch.cuda.memory_allocated()
cached_memory = torch.cuda.memory_reserved()
mem_alloc = f"Memory Allocated: {allocated_memory / 1024**2:.2f} MB"
mem_cache = f"Memory Cached: {cached_memory / 1024**2:.2f} MB"
print(mem_alloc)
print(mem_cache)
torch.cuda.ipc_collect()
gc.collect()
torch.cuda.empty_cache()
torch.cuda.synchronize()
time.sleep(0.01)
allocated_memory = torch.cuda.memory_allocated()
cached_memory = torch.cuda.memory_reserved()
print(f"Memory Allocated after del {mem_alloc}")
print(f"Memory Cached after del {mem_cache}")
Finally, if the above easy fixes don't work. Read up on TCMalloc https://github.com/google/tcmalloc
Or just try implementing this :
# This is a fix for the way that python doesn't release system memory back to the OS and it was leading to locking up the system
libc = ctypes.cdll.LoadLibrary("libc.so.6")
M_MMAP_THRESHOLD = -3
# Set malloc mmap threshold.
libc.mallopt(M_MMAP_THRESHOLD, 2**20)