Issue with CLIP2
There seems to be an inconsistency with the text encoder 2 weights. With the sd_xl_base_1.0.safetensors, I view conditioner.embedders.1.model.text_projection and it is:
tensor([[-0.0348, 0.0218, -0.0025, ..., -0.0034, -0.0138, -0.0075],
[-0.0037, 0.0202, -0.0014, ..., -0.0082, -0.0011, 0.0170],
[ 0.0261, -0.0221, -0.0099, ..., -0.0233, -0.0178, -0.0061],
...,
[ 0.0042, -0.0068, 0.0086, ..., 0.0008, -0.0030, -0.0042],
[-0.0236, 0.0094, 0.0040, ..., -0.0098, 0.0330, 0.0147],
[ 0.0084, -0.0021, -0.0049, ..., 0.0026, -0.0055, -0.0294]],
dtype=torch.float16)
However if I go into text_encoder_2/model.fp16.safetensors, I see text_projection.weight is:
tensor([[-0.0348, -0.0037, 0.0261, ..., 0.0042, -0.0236, 0.0084],
[ 0.0218, 0.0202, -0.0221, ..., -0.0068, 0.0094, -0.0021],
[-0.0025, -0.0014, -0.0099, ..., 0.0086, 0.0040, -0.0049],
...,
[-0.0034, -0.0082, -0.0233, ..., 0.0008, -0.0098, 0.0026],
[-0.0138, -0.0011, -0.0178, ..., -0.0030, 0.0330, -0.0055],
[-0.0075, 0.0170, -0.0061, ..., -0.0042, 0.0147, -0.0294]],
dtype=torch.float16)
Which one should be correct? Is one the original, was the other tuned?
I see they are just transposed