davanstrien/finetune_colpali_v1_2-ufo-4bit · Can you share the notebook pls

smjain

Sep 26

Is it possible to share the notebook for finetuning colpali

davanstrien

Owner Sep 26

•

edited Sep 26

Sure! I used this wonderful notebook from @tonywu71 . I only made some very small changes to how the dataset was loaded but otherwise everything is kept as in the original notebook :)

smjain

Sep 26

Thanks a lot . I also just saw it . Will revise the data loader for my dataset. Thanks

davanstrien

Owner Sep 26

SG @smjain . Excited to see what you train!

smjain

Sep 27

Sure will keep you posted, as soon as I have a stab on it.

wirtsi

Oct 9

I would also be very interested in how you changed the data loading. The way I got the notebook is that the trainer plucks query and anwer from the dataset for the training, right? How did you handle your case where you have multiple queries and explanations in one row?

davanstrien

Owner Oct 9

@wirtsi for now, I only trained on one of the columns but I think it could be worth trying:

randomly choosing one query type
training on multiple queries per image

I am lazy so I would just convert the dataset up front to keep the training loop the same. For the UFO dataset something like this should work:

from datasets import Dataset, Image
import polars as pl

df = pl.scan_parquet('hf://datasets/davanstrien/ufo-ColPali/data/train-00000-of-00001.parquet')
df = df.filter(pl.col("parsed_into_json")==True)
df = df.drop('raw_queries')
unpivoted_df = df.unpivot(
        index=['image'],
        on=['broad_topical_query', 'specific_detail_query', 'visual_element_query'],
        variable_name='query_type',
        value_name='query'
    )

    # Drop the query_type column as it's no longer needed. 
    # collect the results 
result_df = unpivoted_df.drop('query_type').collect()
# convert to dataset.Dataset
ds = Dataset.from_polars(result_df)
# cast the image column to an Image type
ds = ds.cast_column("image", Image())

wirtsi

Oct 10

Fantastic, thank you so much 😍

smjain

Oct 10

You can try this if it helps https://colab.research.google.com/drive/1eYNzzjj9gwuLkjQzyme6If7aHFF0md3j?usp=sharing