Text-to-Video
Diffusers
Safetensors
I2VGenXLPipeline
image-to-video

requirements?

#7
by vladmandic - opened

what are the requirements for this model?

  • running I2VGenXLPipeline in FP16produces NaNs
  • running in FP32 works, but even at 640x352 (which is 1/4 of models native resolution) and decode_chunk_size=1 it pegs the GPU to ~24gb of used VRAM

cc: @sayakpaul @patrickvonplaten

running I2VGenXLPipeline in FP16produces NaNs

Not sure if that's the case. More details below.

running in FP32 works, but even at 640x352 (which is 1/4 of models native resolution) and decode_chunk_size=1 it pegs the GPU to ~24gb of used VRAM

Again, not sure if that's the case. More details below.

My script:

import torch
from diffusers.utils import load_image, export_to_gif
from diffusers import I2VGenXLPipeline

def bytes_to_giga_bytes(bytes):
    return f"{(bytes / 1024 / 1024 / 1024):.3f}"

pipeline = I2VGenXLPipeline.from_pretrained("ali-vilab/i2vgen-xl", torch_dtype=torch.float16, variant="fp16")
pipeline.enable_model_cpu_offload()

image_url = "https://github.com/ali-vilab/i2vgen-xl/blob/main/data/test_images/img_0009.png?raw=true"
image = load_image(image_url).convert("RGB")

prompt = "Papers were floating in the air on a table in the library"
negative_prompt = "Distorted, discontinuous, Ugly, blurry, low resolution, motionless, static, disfigured, disconnected limbs, Ugly faces, incomplete arms"
generator = torch.manual_seed(8888)

frames = pipeline(
    prompt=prompt,
    image=image,
    negative_prompt=negative_prompt,
    generator=generator
).frames[0]
video_path = export_to_gif(frames, "i2v.gif")

memory = bytes_to_giga_bytes(torch.cuda.max_memory_allocated())
print(f"Memory: {memory}GB")

Prints 11.714GB on a 4090.

And the following is the output: https://huggingface.co/datasets/sayakpaul/sample-datasets/blob/main/i2v.gif

Sign up or log in to comment