external CLIP vs internal; VRAM utilization question

by froilo - opened Aug 7

Aug 7

•

a)Is the internal clip fp16?

b)How is so little VRAM used (onle around 4GB) when the text encoder and weights have 23gb
Can things be sped up if more VRAM is utilized?

pretty nice at 4 steps (no upscale)

drbaph

Owner Aug 8

•

edited Aug 8

internal clip was baked at fp8 as to optimise / reduce computational requirements for people with low VRam, the model's weights were merged in such a way to allow generations at 4 steps,
as far as i know you cannot speed it up, it only takes 4 seconds to generate 1024 x 1024 on 24GB VRAM

drbaph changed discussion status to closed Aug 8

zatochu

Aug 9

•

edited Aug 9

Hate to post in a closed topic but I'm not sure the T5 weights in this checkpoint are actually FP8. Transformer +CLIP+T5+VAE checkpoints for Flux that are FP8 should be ~17GB, the +4GB makes me think the T5 was saved as FP16 while the transformer was saved as FP8. See https://huggingface.co/Comfy-Org/flux1-dev/tree/main as an example.

froilo

Aug 10

ok that corroborates my finings that generations were the same with external fp16 and the internal

drbaph

Owner Aug 10

I'm confident i baked the t5xxl_fp8_e4m3fn transformer into this model merge.
The resulting file size will depend on the type of merge [formula] and the quant map.
Every merge will have a different formula and file size - i used the formula provided by Kijai for quantization,and comfyorgs tip on the 4 step merge.
Prior to merging, the quantized the models using Kijai’s formula, resulted in two 12 gb files.
Additionally, I baked the VAE and CLIP fp8 components into the models wich weight extra 4.5 gb + 319mb
The T5 model in fp16 versus fp8 doesn’t show a significant difference in generation, so many people prefer using the FP8 model for its greater computational efficiency during generation.
You could also perform a test on the model to observe VRAM usage during inference to assess its efficiency on an 8gb Card vs an external fp16 t5 clip .

zatochu

Aug 10

Ah, I think I get it now. I didn't notice the merging tip part before.

froilo

Aug 10

The T5 model in fp16 versus fp8 doesn’t show a significant difference in generation, so many people prefer using the FP8 model for its greater computational efficiency during generation.
gud
You could also perform a test on the model to observe VRAM usage during inference to assess its efficiency on an 8gb Card vs an external fp16 t5 clip .
i did not

froilo

Aug 11

@drbaph
just heads up we need that merge in NF4 ... like... ASAP
https://www.reddit.com/r/StableDiffusion/comments/1epcdov/bitsandbytes_guidelines_and_flux_6gb8gb_vram/

Upload images, audio, and videos by dragging in the text input, pasting, or clicking here.

Tap or paste here to upload images

· Sign up or log in to comment