Mixing text-only data into fine-tuning
I would like to add some text-only data into my fine-tuning dataset (which has images).
How can I mix my text-only data with the regular image-text data?
I know that Idefics2 can take text-only data as an input, but I want to create a mix on the batch level.
I'm currently using the below DataCollator for processing the usual image-text data:
class MyDataCollator:
def __init__(self, processor):
self.processor = processor
self.image_token_id = self.processor.tokenizer.additional_special_tokens_ids[
self.processor.tokenizer.additional_special_tokens.index("<image>")
]
def __call__(self, examples):
texts = []
images = []
for example in examples:
image = example["images"][0]
if image is None:
continue
for example_text in example["texts"]:
question = example_text["user"]
answer = example_text["assistant"]
messages = [
{
"role": "user",
"content": [
{"type": "text", "text": "Answer briefly."},
{"type": "image"},
{"type": "text", "text": question}
]
},
{
"role": "assistant",
"content": [
{"type": "text", "text": answer}
]
}
]
text = self.processor.apply_chat_template(messages, add_generation_prompt=False)
# added for the base model
text = text.replace("<end_of_utterance>", "")
texts.append(text.strip())
images.append([image])
batch = self.processor(text=texts, images=images, return_tensors="pt", padding=True)
labels = batch["input_ids"].clone()
labels[labels == self.processor.tokenizer.pad_token_id] = self.image_token_id
# labels[labels == self.processor.tokenizer.pad_token_id] = -100
# labels[labels == self.image_token_id] = -100
batch["labels"] = labels
return batch
I thought about adding None
-s into the images
list, but it gives the error:
---------------------------------------------------------------------------
ValueError Traceback (most recent call last)
Cell In[101], line 2
1 data_collator = MyDataCollatorTheCauldron(processor)
----> 2 collated_text = data_collator.__call__(sumjpn_data_50k[:10])
Cell In[100], line 67
62 images.append([image])
63 #if image is None:
64 #batch = self.processor(text=texts, images=images, return_tensors="pt", padding=True)
65 # batch = self.processor(text=texts, return_tensors="pt", padding=True)
66 #else:
---> 67 batch = self.processor(text=texts, images=images, return_tensors="pt", padding=True)
69 labels = batch["input_ids"].clone()
70 labels[labels == self.processor.tokenizer.pad_token_id] = self.image_token_id
File ~/miniconda3/envs/vdu/lib/python3.10/site-packages/transformers/models/idefics2/processing_idefics2.py:230, in Idefics2Processor.__call__(self, text, images, image_seq_len, padding, truncation, max_length, is_split_into_words, add_special_tokens, return_tensors)
225 raise ValueError(
226 f"The number of images in the text {n_images_in_text} and images {n_images_in_images} should be the same."
227 )
229 # Load images if they are URLs
--> 230 images = [[load_image(im) for im in sample] for sample in images]
231 image_inputs = self.image_processor(images, return_tensors=return_tensors)
232 inputs.update(image_inputs)
File ~/miniconda3/envs/vdu/lib/python3.10/site-packages/transformers/models/idefics2/processing_idefics2.py:230, in <listcomp>(.0)
225 raise ValueError(
226 f"The number of images in the text {n_images_in_text} and images {n_images_in_images} should be the same."
227 )
229 # Load images if they are URLs
--> 230 images = [[load_image(im) for im in sample] for sample in images]
231 image_inputs = self.image_processor(images, return_tensors=return_tensors)
232 inputs.update(image_inputs)
File ~/miniconda3/envs/vdu/lib/python3.10/site-packages/transformers/models/idefics2/processing_idefics2.py:230, in <listcomp>(.0)
225 raise ValueError(
226 f"The number of images in the text {n_images_in_text} and images {n_images_in_images} should be the same."
227 )
229 # Load images if they are URLs
--> 230 images = [[load_image(im) for im in sample] for sample in images]
231 image_inputs = self.image_processor(images, return_tensors=return_tensors)
232 inputs.update(image_inputs)
File ~/miniconda3/envs/vdu/lib/python3.10/site-packages/transformers/image_utils.py:332, in load_image(image, timeout)
330 image = image
331 else:
--> 332 raise ValueError(
333 "Incorrect format used for image. Should be an url linking to an image, a base64 string, a local path, or a PIL image."
334 )
335 image = PIL.ImageOps.exif_transpose(image)
336 image = image.convert("RGB")
ValueError: Incorrect format used for image. Should be an url linking to an image, a base64 string, a local path, or a PIL image.
Do you have any ideas?
hi
@bilibraker
can you say more about "adding text-only data"?
what you are showing is fine-tuning under the dialogue format indeed so how about adding the text in the user input?
Let's say we have a batch of 5 data samples with 3 being image-text and 2 being text-only with the following schemas:
image-text data schema (as in The Cauldron)
{
"images" = [PIL.Image]
"texts" = [
{
"user": "Question: How many actions are depicted in the diagram?\nChoices:\nA. 6.\nB. 4.\nC. 8.\nD. 7.\nAnswer with the letter.",
"assistant": "Answer: D",
"source": "TQA"
}
]
}
text-only data schema (the difference is only the "images" key)
{
"images" = None
"texts" = [
{
"user": "Question: How many actions are depicted in the diagram?\nChoices:\nA. 6.\nB. 4.\nC. 8.\nD. 7.\nAnswer with the letter.",
"assistant": "Answer: D",
"source": "TQA"
}
]
}
I would like to feed this mixed batch of text-only and image-text data to a DataCollator
that can process both types of data samples.
@VictorSanh
I temporarily solved the issue by adding empty images to the text-only instances, but I'm still curious about a more robust solution.
Also, how did you solve this issue when training Idefics2
? Its training data also contains both text-only and image-text data.
@VictorSanh I temporarily solved the issue by adding empty images to the text-only instances, but I'm still curious about a more robust solution.
Also, how did you solve this issue when trainingIdefics2
? Its training data also contains both text-only and image-text data.
Same question, is the empty image valid?