HuggingFaceM4
/

idefics2-8b

Image-Text-to-Text

text-generation-inference

Inference Endpoints

Model card Files Files and versions Community

Comment on image splitting

#4

by HugoLaurencon - opened Apr 16

base: refs/heads/main

←

from: refs/pr/4

Discussion Files changed

Files changed (1) hide show

README.md +2 -0

README.md CHANGED Viewed

@@ -223,6 +223,8 @@ Given the high resolution supported, the vision part of the model can be memory
 - **deactivate the image splitting.** To do so, add `do_image_splitting=False` when initializing the processor (`AutoProcessor.from_pretrained`). There are no changes required on the model side. Note that only the sft model has been trained with image splitting.
 - **decrease the maximum image resolution.** To do so, add `size= {"longest_edge": 448, "shortest_edge": 378}` when initializing the processor (`AutoProcessor.from_pretrained`). In particular, the `longest_edge` value can be adapted to fit the need. We recommend using values that are multiples of 14. There are no changes required on the model side.
 **Using Flash-attention 2 to speed up generation**
 <details><summary>Click to expand.</summary>

 - **deactivate the image splitting.** To do so, add `do_image_splitting=False` when initializing the processor (`AutoProcessor.from_pretrained`). There are no changes required on the model side. Note that only the sft model has been trained with image splitting.
 - **decrease the maximum image resolution.** To do so, add `size= {"longest_edge": 448, "shortest_edge": 378}` when initializing the processor (`AutoProcessor.from_pretrained`). In particular, the `longest_edge` value can be adapted to fit the need. We recommend using values that are multiples of 14. There are no changes required on the model side.
+`do_image_splitting=True` is especially needed to boost performance on OCR tasks where a very large image is used as input. For the regular VQA or captioning tasks, this argument can be safely set to `False` with minimal impact on performance (see the evaluation table above).
 **Using Flash-attention 2 to speed up generation**
 <details><summary>Click to expand.</summary>