google/paligemma-3b-mix-448 · How to sft this model?(prefix-LM attention-mask related)

Hi, thanks for your great work! I've find that self._merge_input_ids_with_image_features will generate a 4D full-attention-mask(if there's no padded). When I sft this model, for the text-input(instruction), attention_mask should be fully. As for the text-output(response), attention_mask should be causal. In PaliGemma's forward func, this is not supported. If I input 4D fully-mask(for image&instruction input) + casual-mask(for response output), self._merge_input_ids_with_image_features will not work(because it needs 2D mask).